[PATCH 5/5] mm: support choosing to do THP COW for anonymous pmd entry.

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

From: Luka Bai <lukafocus@icloud.com>
To: linux-mm@kvack.org
Cc: Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,  Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	 "Liam R. Howlett" <liam@infradead.org>,
	Nico Pache <npache@redhat.com>,
	 Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	 Barry Song <baohua@kernel.org>,
	Lance Yang <lance.yang@linux.dev>,
	 Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,  Jann Horn <jannh@google.com>,
	Arnd Bergmann <arnd@arndb.de>,  Kairui Song <kasong@tencent.com>,
	linux-kernel@vger.kernel.org,  linux-arch@vger.kernel.org,
	linux-doc@vger.kernel.org,  Luka Bai <lukabai@tencent.com>
Subject: [PATCH 5/5] mm: support choosing to do THP COW for anonymous pmd entry.
Date: Fri, 01 May 2026 13:55:46 +0800	[thread overview]
Message-ID: <20260501-thp_cow-v1-5-005377483738@tencent.com> (raw)
In-Reply-To: <20260501-thp_cow-v1-0-005377483738@tencent.com>

From: Luka Bai <lukabai@tencent.com>

For pmd mapped anonymous folios, we currently do not do COW for the
whole vma region, because we don't want to copy and unshare the full
PMD range on the first write fault.

That proposal holds for the most workloads, however, that also makes
the pmd entry split into 512 4K ptes in the child process after we
write on a part of the folio.

For example, if process A and B share a pmd sized folio, if B does
writing on a small region, its pmd mapping will be split into 511
4K ptes which still point to the original pmd sized folio, and 1 4K
pte pointing to the new 4K page.

This is quite good for memory utilization, but it also make the tlb gain
caused by pmd entry suddenly "vanish" after a simple write, which
causes a observable performance decrease in some workloads. And also,
it adds some "uncertainty" to the THP since it does splitting
transparently in the COW scenorio which sometimes can cause trouble
to ones that need stable hugepages.

This patch adds support for pmd sized COW of anonymous page with
switch controlling. The reason we add switch is that for some scenorio,
the performance matters more, but for other workloads maybe the memory
waste is more unbearable. So we can use the THP setup to control this
configuration, either on the vma level or the global level.

The patch is relatively simple, we add function wp_huge_pmd_page_copy
to do the hugepage copy on write part, and do the allocation, accouting
and cache flushing just like in 4K path. We use the newly reconstructed
map_anon_folio_pmd_pf to do the mapping since it can properly support
FAULT_FLAG_UNSHARE right now.

We remove the ref checking in do_huge_pmd_wp_page, since we have
supported copying the pmd folio right now, we'll check the refcount
in the following folio_ref_count to make sure if the folio can be
exclusively used. If not, we can always do copy on write for this
folio just like in do_wp_page when THP COW is enabled.

Signed-off-by: Luka Bai <lukabai@tencent.com>
---
 mm/huge_memory.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 116 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1e661b411b2e..a05a4456e5a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -40,6 +40,7 @@
 #include <linux/pgalloc.h>
 #include <linux/pgalloc_tag.h>
 #include <linux/pagewalk.h>
+#include <linux/delayacct.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -2196,6 +2197,94 @@ static vm_fault_t do_huge_zero_wp_pmd(struct vm_fault *vmf)
 	return ret;
 }
 
+static vm_fault_t wp_huge_pmd_page_copy(struct vm_fault *vmf, struct folio *old_folio)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct folio *new_folio = NULL;
+	struct page *new_page, *old_page;
+	unsigned long pmd_address = vmf->address & HPAGE_PMD_MASK;
+	struct mmu_notifier_range range;
+	vm_fault_t ret = 0;
+	int i;
+
+	delayacct_wpcopy_start();
+
+	old_page = folio_page(old_folio, 0);
+	ret = vmf_anon_prepare(vmf);
+	if (unlikely(ret)) {
+		if (ret != VM_FAULT_RETRY)
+			ret = VM_FAULT_FALLBACK;
+		goto out;
+	}
+
+	new_folio = vma_alloc_anon_folio_pmd(vma, vmf->address);
+	if (unlikely(!new_folio)) {
+		ret = VM_FAULT_FALLBACK;
+		goto out;
+	}
+
+	if (copy_user_large_folio(new_folio, old_folio,
+		pmd_address, vma)) {
+		ret = VM_FAULT_HWPOISON;
+		goto out;
+	}
+
+	new_page = folio_page(new_folio, 0);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		kmsan_copy_page_meta(new_page + i, old_page + i);
+
+	__folio_mark_uptodate(new_folio);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+				pmd_address, pmd_address + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	spin_lock(vmf->ptl);
+	if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) {
+		update_mmu_cache_pmd(vma, pmd_address, vmf->pmd);
+		ret = 0;
+		goto out_unlock;
+	}
+
+	flush_cache_range(vma, pmd_address, pmd_address + HPAGE_PMD_SIZE);
+	/*
+	 * Clear the pmd entry and flush it first, before updating the
+	 * pmd with the new entry, to keep TLBs on different CPUs in
+	 * sync.
+	 */
+	(void)pmdp_huge_clear_flush(vma, pmd_address, vmf->pmd);
+	/*
+	 * We just temporarily decrement the mm_counter here, and it will be added back in
+	 * map_anon_folio_pmd_pf below.
+	 */
+	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+	map_anon_folio_pmd_pf(new_folio, vmf, true);
+	folio_remove_rmap_pmd(old_folio, old_page, vma);
+
+	spin_unlock(vmf->ptl);
+
+	mmu_notifier_invalidate_range_end(&range);
+	/* This put is for the folio_get() in the caller */
+	folio_put(old_folio);
+	free_swap_cache(old_folio);
+
+	/* This put is for decrementing refcount after we switch page table mapping */
+	folio_put(old_folio);
+
+	delayacct_wpcopy_end();
+	return 0;
+out_unlock:
+	spin_unlock(vmf->ptl);
+	mmu_notifier_invalidate_range_end(&range);
+out:
+	folio_put(old_folio);
+	if (new_folio)
+		folio_put(new_folio);
+
+	delayacct_wpcopy_end();
+	return ret;
+}
+
 vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 {
 	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
@@ -2204,12 +2293,13 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	pmd_t orig_pmd = vmf->orig_pmd;
+	vm_fault_t ret;
 
 	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
 
 	if (is_huge_zero_pmd(orig_pmd)) {
-		vm_fault_t ret = do_huge_zero_wp_pmd(vmf);
+		ret = do_huge_zero_wp_pmd(vmf);
 
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -2253,14 +2343,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		goto reuse;
 	}
 
-	/*
-	 * See do_wp_page(): we can only reuse the folio exclusively if
-	 * there are no additional references. Note that we always drain
-	 * the LRU cache immediately after adding a THP.
-	 */
-	if (folio_ref_count(folio) >
-			1 + folio_test_swapcache(folio) * folio_nr_pages(folio))
-		goto unlock_fallback;
 	if (folio_test_swapcache(folio))
 		folio_free_swap(folio);
 	if (folio_ref_count(folio) == 1) {
@@ -2282,6 +2364,31 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		return 0;
 	}
 
+	/*
+	 * Only do hugepage copy on write if the parameter setup supports it.
+	 */
+	if (!hugepage_cow_enabled(vma))
+		goto unlock_fallback;
+
+	/*
+	 * For vma without a vm_ops(anonymous vma), there should not be VM_SHARED or
+	 * VM_MAYSHARE types.
+	 */
+	VM_WARN_ON_ONCE_VMA(vma->vm_flags & (VM_SHARED | VM_MAYSHARE), vma);
+
+	folio_unlock(folio);
+	/*
+	 * Copy on write branch here.
+	 * We are about to unlock the ptl here, so we need to get folio before that
+	 * in case the folio gets freed in the meantime.
+	 */
+	folio_get(folio);
+	spin_unlock(vmf->ptl);
+	ret = wp_huge_pmd_page_copy(vmf, folio);
+	if (ret & VM_FAULT_FALLBACK)
+		goto fallback;
+	return ret;
+
 unlock_fallback:
 	folio_unlock(folio);
 	spin_unlock(vmf->ptl);

-- 
2.52.0

next prev parent reply	other threads:[~2026-05-01  5:56 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-01  5:55 [PATCH 0/5] mm: Support selecting doing direct COW for anonymous pmd entry Luka Bai
2026-05-01  5:55 ` [PATCH 1/5] mm: add basic madvise helpers and branch for THP setup Luka Bai
2026-05-01  5:55 ` [PATCH 2/5] mm: add pmd level THP COW parameter in sysfs Luka Bai
2026-05-01  5:55 ` [PATCH 3/5] mm: add pmd level THP COW judgement helpers Luka Bai
2026-05-01  5:55 ` [PATCH 4/5] mm: enable map_anon_folio_pmd_nopf to handle unshare Luka Bai
2026-05-01  5:55 ` Luka Bai [this message]
2026-05-01  7:11   ` [PATCH 5/5] mm: support choosing to do THP COW for anonymous pmd entry David Hildenbrand (Arm)
2026-05-01 15:01     ` Luka Bai
2026-05-01  7:07 ` [PATCH 0/5] mm: Support selecting doing direct " David Hildenbrand (Arm)
2026-05-01 16:16   ` Luka Bai
2026-05-01 18:30     ` David Hildenbrand (Arm)
2026-05-02  5:06       ` Luka Bai
2026-05-03  7:03 ` [syzbot ci] " syzbot ci

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1e661b411b2 dfblob:a05a4456e5a )
 OR (
bs:"[PATCH 5/5] mm: support choosing to do THP COW for anonymous pmd entry." )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260501-thp_cow-v1-5-005377483738@tencent.com \
    --to=lukafocus@icloud.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=jannh@google.com \
    --cc=kasong@tencent.com \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=lukabai@tencent.com \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox