linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] add mTHP support for wp
@ 2025-08-14 11:38 Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 1/7] mm: memory: replace single-operation with multi-operation in wp Vernon Yang
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-14 11:38 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, glider, elver, dvyukov,
	vbabka, rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran
  Cc: linux-mm, linux-kernel, Vernon Yang

Hi all,

This patchset is introduced to make pagefault write-protect copy support
mthp, with this series, pagefault write-protect copy will have a 9-14%
performance improvement.

Currently pagefaults on anonymous pages support mthp [1], and hardware
features (such as arm64 contpte) can be used to store multiple ptes in
one TLB entry, reducing the probability of TLB misses. However, when the
process is forked and the cow is triggered again, the above optimization
effect is lost, and only 4KB is requested once at a time.

Therefore, make pagefault write-protect copy support mthp to maintain the
optimization effect of TLB and improve the efficiency of cow pagefault.

vm-scalability usemem shows a great improvement,
test using: usemem -n 32 --prealloc --prefault 249062617
(result unit is KB/s, bigger is better)

|    size     | w/o patch | w/ patch  |  delta  |
|-------------|-----------|-----------|---------|
| baseline 4K | 723041.63 | 717643.21 | -0.75%  |
| mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
| mthp 32K    | 746060.91 | 836261.83 | +12.09% |
| mthp 64K    | 747333.18 | 855570.43 | +14.48% |

This series is based on Linux v6.16 (038d61fd6422).

Thanks,
Vernon

[1] https://lore.kernel.org/all/20231207161211.2374093-1-ryan.roberts@arm.com/

Vernon Yang (7):
  mm: memory: replace single-operation with multi-operation in wp
  mm: memory: add ptep_clear_flush_range function
  mm: memory: add kmsan_copy_pages_meta function
  mm: memory: add offset to start copy for copy_user_gigantic_page
  mm: memory: improve wp_page_copy readability
  mm: memory: add mTHP support for wp
  selftests: mm: support wp mTHP collapse testing

 include/linux/huge_mm.h                 |   3 +
 include/linux/kmsan.h                   |  13 +-
 include/linux/mm.h                      |   8 +
 include/linux/pgtable.h                 |   3 +
 mm/hugetlb.c                            |   6 +-
 mm/kmsan/shadow.c                       |  26 +-
 mm/memory.c                             | 309 ++++++++++++++++++------
 mm/pgtable-generic.c                    |  20 ++
 tools/testing/selftests/mm/khugepaged.c |   5 +-
 9 files changed, 302 insertions(+), 91 deletions(-)

--
2.50.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/7] mm: memory: replace single-operation with multi-operation in wp
  2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
@ 2025-08-14 11:38 ` Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 2/7] mm: memory: add ptep_clear_flush_range function Vernon Yang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-14 11:38 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, glider, elver, dvyukov,
	vbabka, rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran
  Cc: linux-mm, linux-kernel, Vernon Yang

In preparation for wp to support mthp, the single-page operation
function is replaced with a multi-page operation function, without
any functional changes.

Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
---
 include/linux/mm.h |  7 +++++++
 mm/memory.c        | 12 ++++++------
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa538feaa8d9..80c6673f419e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2589,6 +2589,13 @@ static inline void inc_mm_counter(struct mm_struct *mm, int member)
 	mm_trace_rss_stat(mm, member);
 }
 
+static inline void sub_mm_counter(struct mm_struct *mm, int member, long value)
+{
+	percpu_counter_sub(&mm->rss_stat[member], value);
+
+	mm_trace_rss_stat(mm, member);
+}
+
 static inline void dec_mm_counter(struct mm_struct *mm, int member)
 {
 	percpu_counter_dec(&mm->rss_stat[member]);
diff --git a/mm/memory.c b/mm/memory.c
index b0cda5aab398..a6bc1db22387 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3606,14 +3606,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
 		if (old_folio) {
 			if (!folio_test_anon(old_folio)) {
-				dec_mm_counter(mm, mm_counter_file(old_folio));
-				inc_mm_counter(mm, MM_ANONPAGES);
+				sub_mm_counter(mm, mm_counter_file(old_folio), 1);
+				add_mm_counter(mm, MM_ANONPAGES, 1);
 			}
 		} else {
 			ksm_might_unmap_zero_page(mm, vmf->orig_pte);
 			inc_mm_counter(mm, MM_ANONPAGES);
 		}
-		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
+		flush_cache_range(vma, vmf->address, vmf->address + PAGE_SIZE);
 		entry = folio_mk_pte(new_folio, vma->vm_page_prot);
 		entry = pte_sw_mkyoung(entry);
 		if (unlikely(unshare)) {
@@ -3636,7 +3636,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		folio_add_new_anon_rmap(new_folio, vma, vmf->address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(new_folio, vma);
 		BUG_ON(unshare && pte_write(entry));
-		set_pte_at(mm, vmf->address, vmf->pte, entry);
+		set_ptes(mm, vmf->address, vmf->pte, entry, 1);
 		update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
 		if (old_folio) {
 			/*
@@ -3661,7 +3661,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			folio_remove_rmap_pte(old_folio, vmf->page, vma);
+			folio_remove_rmap_ptes(old_folio, vmf->page, 1, vma);
 		}
 
 		/* Free the old page.. */
@@ -3676,7 +3676,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	mmu_notifier_invalidate_range_end(&range);
 
 	if (new_folio)
-		folio_put(new_folio);
+		folio_put_refs(new_folio, 1);
 	if (old_folio) {
 		if (page_copied)
 			free_swap_cache(old_folio);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/7] mm: memory: add ptep_clear_flush_range function
  2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 1/7] mm: memory: replace single-operation with multi-operation in wp Vernon Yang
@ 2025-08-14 11:38 ` Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 3/7] mm: memory: add kmsan_copy_pages_meta function Vernon Yang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-14 11:38 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, glider, elver, dvyukov,
	vbabka, rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran
  Cc: linux-mm, linux-kernel, Vernon Yang

In preparation for wp to support mthp, add the ptep_clear_flush_range()
function to clear and flush TLB for PTE in the specified range.

Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
---
 include/linux/pgtable.h |  3 +++
 mm/memory.c             |  2 +-
 mm/pgtable-generic.c    | 20 ++++++++++++++++++++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0b6e1f781d86..1ccddcd0098f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -826,6 +826,9 @@ static inline void clear_not_present_full_ptes(struct mm_struct *mm,
 extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
 			      unsigned long address,
 			      pte_t *ptep);
+extern void ptep_clear_flush_range(struct vm_area_struct *vma,
+				   unsigned long address,
+				   pte_t *ptep, unsigned int nr);
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
diff --git a/mm/memory.c b/mm/memory.c
index a6bc1db22387..90cbed5ad150 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3632,7 +3632,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * that left a window where the new PTE could be loaded into
 		 * some TLBs while the old PTE remains in others.
 		 */
-		ptep_clear_flush(vma, vmf->address, vmf->pte);
+		ptep_clear_flush_range(vma, vmf->address, vmf->pte, 1);
 		folio_add_new_anon_rmap(new_folio, vma, vmf->address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(new_folio, vma);
 		BUG_ON(unshare && pte_write(entry));
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5a882f2b10f9..cdffec4f54d9 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -101,6 +101,26 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		flush_tlb_page(vma, address);
 	return pte;
 }
+
+void ptep_clear_flush_range(struct vm_area_struct *vma, unsigned long address,
+			    pte_t *ptep, unsigned int nr)
+{
+	struct mm_struct *mm = (vma)->vm_mm;
+	bool accessible = false;
+	pte_t pte;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		pte = ptep_get_and_clear(mm, address + i * PAGE_SIZE, ptep + i);
+		if (!accessible && pte_accessible(mm, pte))
+			accessible = true;
+	}
+
+	if (accessible) {
+		flush_tlb_mm_range(vma->vm_mm, address, address + nr * PAGE_SIZE,
+				   PAGE_SHIFT, false);
+	}
+}
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/7] mm: memory: add kmsan_copy_pages_meta function
  2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 1/7] mm: memory: replace single-operation with multi-operation in wp Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 2/7] mm: memory: add ptep_clear_flush_range function Vernon Yang
@ 2025-08-14 11:38 ` Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 4/7] mm: memory: add offset to start copy for copy_user_gigantic_page Vernon Yang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-14 11:38 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, glider, elver, dvyukov,
	vbabka, rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran
  Cc: linux-mm, linux-kernel, Vernon Yang

In preparation for wp to support mthp, add the kmsan_copy_pages_meta()
function to copy multiple pages of the source page to the target page.

Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
---
 include/linux/kmsan.h | 13 ++++++++++---
 mm/kmsan/shadow.c     | 26 +++++++++++++++++++-------
 mm/memory.c           |  2 +-
 3 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index 2b1432cc16d5..a3f227c3947f 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -78,15 +78,16 @@ void kmsan_alloc_page(struct page *page, unsigned int order, gfp_t flags);
 void kmsan_free_page(struct page *page, unsigned int order);
 
 /**
- * kmsan_copy_page_meta() - Copy KMSAN metadata between two pages.
+ * kmsan_copy_pages_meta() - Copy KMSAN metadata between two pages.
  * @dst: destination page.
  * @src: source page.
+ * @nr_pages: copy number of page.
  *
  * KMSAN copies the contents of metadata pages for @src into the metadata pages
  * for @dst. If @dst has no associated metadata pages, nothing happens.
  * If @src has no associated metadata pages, @dst metadata pages are unpoisoned.
  */
-void kmsan_copy_page_meta(struct page *dst, struct page *src);
+void kmsan_copy_pages_meta(struct page *dst, struct page *src, int nr_pages);
 
 /**
  * kmsan_slab_alloc() - Notify KMSAN about a slab allocation.
@@ -324,7 +325,8 @@ static inline void kmsan_free_page(struct page *page, unsigned int order)
 {
 }
 
-static inline void kmsan_copy_page_meta(struct page *dst, struct page *src)
+static inline void kmsan_copy_pages_meta(struct page *dst, struct page *src,
+					int nr_pages)
 {
 }
 
@@ -407,4 +409,9 @@ static inline void *memset_no_sanitize_memory(void *s, int c, size_t n)
 
 #endif
 
+static inline void kmsan_copy_page_meta(struct page *dst, struct page *src)
+{
+	kmsan_copy_pages_meta(dst, src, 1);
+}
+
 #endif /* _LINUX_KMSAN_H */
diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index 54f3c3c962f0..1dd0f7a1eb5f 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -148,24 +148,36 @@ void *kmsan_get_metadata(void *address, bool is_origin)
 	return (is_origin ? origin_ptr_for(page) : shadow_ptr_for(page)) + off;
 }
 
-void kmsan_copy_page_meta(struct page *dst, struct page *src)
+
+void kmsan_copy_pages_meta(struct page *dst, struct page *src, int nr_pages)
 {
+	int i;
+
 	if (!kmsan_enabled || kmsan_in_runtime())
 		return;
-	if (!dst || !page_has_metadata(dst))
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!dst || !page_has_metadata(dst))
+			break;
+		if (!src || !page_has_metadata(src))
+			break;
+	}
+
+	if (i == 0 && !dst) {
 		return;
-	if (!src || !page_has_metadata(src)) {
-		kmsan_internal_unpoison_memory(page_address(dst), PAGE_SIZE,
+	} else if (i < nr_pages) {
+		kmsan_internal_unpoison_memory(page_address(dst),
+					       nr_pages * PAGE_SIZE,
 					       /*checked*/ false);
 		return;
 	}
 
 	kmsan_enter_runtime();
-	__memcpy(shadow_ptr_for(dst), shadow_ptr_for(src), PAGE_SIZE);
-	__memcpy(origin_ptr_for(dst), origin_ptr_for(src), PAGE_SIZE);
+	__memcpy(shadow_ptr_for(dst), shadow_ptr_for(src), nr_pages * PAGE_SIZE);
+	__memcpy(origin_ptr_for(dst), origin_ptr_for(src), nr_pages * PAGE_SIZE);
 	kmsan_leave_runtime();
 }
-EXPORT_SYMBOL(kmsan_copy_page_meta);
+EXPORT_SYMBOL(kmsan_copy_pages_meta);
 
 void kmsan_alloc_page(struct page *page, unsigned int order, gfp_t flags)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 90cbed5ad150..7b8c7d0f9ff4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3589,7 +3589,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			delayacct_wpcopy_end();
 			return err == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
 		}
-		kmsan_copy_page_meta(&new_folio->page, vmf->page);
+		kmsan_copy_pages_meta(&new_folio->page, vmf->page, 1);
 	}
 
 	__folio_mark_uptodate(new_folio);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 4/7] mm: memory: add offset to start copy for copy_user_gigantic_page
  2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
                   ` (2 preceding siblings ...)
  2025-08-14 11:38 ` [RFC PATCH 3/7] mm: memory: add kmsan_copy_pages_meta function Vernon Yang
@ 2025-08-14 11:38 ` Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 5/7] mm: memory: improve wp_page_copy readability Vernon Yang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-14 11:38 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, glider, elver, dvyukov,
	vbabka, rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran
  Cc: linux-mm, linux-kernel, Vernon Yang

In preparation for wp support mthp, add offset to start copy for
copy_user_large_folio() function.

Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
---
 include/linux/mm.h |  1 +
 mm/hugetlb.c       |  6 +++---
 mm/memory.c        | 11 ++++++++---
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80c6673f419e..e178fb1049f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4006,6 +4006,7 @@ enum mf_action_page_type {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 void folio_zero_user(struct folio *folio, unsigned long addr_hint);
 int copy_user_large_folio(struct folio *dst, struct folio *src,
+			  unsigned int offset,
 			  unsigned long addr_hint,
 			  struct vm_area_struct *vma);
 long copy_folio_from_user(struct folio *dst_folio,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a0d285d20992..91e1ec73f092 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5682,7 +5682,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 					break;
 				}
 				ret = copy_user_large_folio(new_folio, pte_folio,
-							    addr, dst_vma);
+							    0, addr, dst_vma);
 				folio_put(pte_folio);
 				if (ret) {
 					folio_put(new_folio);
@@ -6277,7 +6277,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	if (unlikely(ret))
 		goto out_release_all;
 
-	if (copy_user_large_folio(new_folio, old_folio, vmf->real_address, vma)) {
+	if (copy_user_large_folio(new_folio, old_folio, 0, vmf->real_address, vma)) {
 		ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h));
 		goto out_release_all;
 	}
@@ -6992,7 +6992,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			*foliop = NULL;
 			goto out;
 		}
-		ret = copy_user_large_folio(folio, *foliop, dst_addr, dst_vma);
+		ret = copy_user_large_folio(folio, *foliop, 0, dst_addr, dst_vma);
 		folio_put(*foliop);
 		*foliop = NULL;
 		if (ret) {
diff --git a/mm/memory.c b/mm/memory.c
index 7b8c7d0f9ff4..3451e6e5aabd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7071,6 +7071,7 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
+				   unsigned int offset,
 				   unsigned long addr_hint,
 				   struct vm_area_struct *vma,
 				   unsigned int nr_pages)
@@ -7082,7 +7083,7 @@ static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
 
 	for (i = 0; i < nr_pages; i++) {
 		dst_page = folio_page(dst, i);
-		src_page = folio_page(src, i);
+		src_page = folio_page(src, offset + i);
 
 		cond_resched();
 		if (copy_mc_user_highpage(dst_page, src_page,
@@ -7095,6 +7096,7 @@ static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
 struct copy_subpage_arg {
 	struct folio *dst;
 	struct folio *src;
+	unsigned int offset;
 	struct vm_area_struct *vma;
 };
 
@@ -7102,7 +7104,7 @@ static int copy_subpage(unsigned long addr, int idx, void *arg)
 {
 	struct copy_subpage_arg *copy_arg = arg;
 	struct page *dst = folio_page(copy_arg->dst, idx);
-	struct page *src = folio_page(copy_arg->src, idx);
+	struct page *src = folio_page(copy_arg->src, copy_arg->offset + idx);
 
 	if (copy_mc_user_highpage(dst, src, addr, copy_arg->vma))
 		return -EHWPOISON;
@@ -7110,17 +7112,20 @@ static int copy_subpage(unsigned long addr, int idx, void *arg)
 }
 
 int copy_user_large_folio(struct folio *dst, struct folio *src,
+			  unsigned int offset,
 			  unsigned long addr_hint, struct vm_area_struct *vma)
 {
 	unsigned int nr_pages = folio_nr_pages(dst);
 	struct copy_subpage_arg arg = {
 		.dst = dst,
 		.src = src,
+		.offset = offset,
 		.vma = vma,
 	};
 
 	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		return copy_user_gigantic_page(dst, src, addr_hint, vma, nr_pages);
+		return copy_user_gigantic_page(dst, src, offset, addr_hint,
+					       vma, nr_pages);
 
 	return process_huge_page(addr_hint, nr_pages, copy_subpage, &arg);
 }
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 5/7] mm: memory: improve wp_page_copy readability
  2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
                   ` (3 preceding siblings ...)
  2025-08-14 11:38 ` [RFC PATCH 4/7] mm: memory: add offset to start copy for copy_user_gigantic_page Vernon Yang
@ 2025-08-14 11:38 ` Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 6/7] mm: memory: add mTHP support for wp Vernon Yang
  2025-08-14 11:38 ` [RFC PATCH 7/7] selftests: mm: support wp mTHP collapse testing Vernon Yang
  6 siblings, 0 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-14 11:38 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, glider, elver, dvyukov,
	vbabka, rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran
  Cc: linux-mm, linux-kernel, Vernon Yang

In preparation for wp support mthp, improve wp_page_copy() readability,
without any functional changes.

Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
---
 mm/memory.c | 148 +++++++++++++++++++++++++++-------------------------
 1 file changed, 77 insertions(+), 71 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3451e6e5aabd..8dd869b0cfc1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3551,16 +3551,18 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	struct mm_struct *mm = vma->vm_mm;
 	struct folio *old_folio = NULL;
 	struct folio *new_folio = NULL;
+	struct page *old_page = vmf->page;
 	pte_t entry;
 	int page_copied = 0;
 	struct mmu_notifier_range range;
 	vm_fault_t ret;
 	bool pfn_is_zero;
+	unsigned long addr;
 
 	delayacct_wpcopy_start();
 
-	if (vmf->page)
-		old_folio = page_folio(vmf->page);
+	if (old_page)
+		old_folio = page_folio(old_page);
 	ret = vmf_anon_prepare(vmf);
 	if (unlikely(ret))
 		goto out;
@@ -3570,10 +3572,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	if (!new_folio)
 		goto oom;
 
+	addr = ALIGN_DOWN(vmf->address, PAGE_SIZE);
+
 	if (!pfn_is_zero) {
 		int err;
 
-		err = __wp_page_copy_user(&new_folio->page, vmf->page, vmf);
+		err = __wp_page_copy_user(&new_folio->page, old_page, vmf);
 		if (err) {
 			/*
 			 * COW failed, if the fault was solved by other,
@@ -3589,90 +3593,92 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			delayacct_wpcopy_end();
 			return err == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
 		}
-		kmsan_copy_pages_meta(&new_folio->page, vmf->page, 1);
+		kmsan_copy_pages_meta(&new_folio->page, old_page, 1);
 	}
 
 	__folio_mark_uptodate(new_folio);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
-				vmf->address & PAGE_MASK,
-				(vmf->address & PAGE_MASK) + PAGE_SIZE);
+				addr, addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
-	if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
-		if (old_folio) {
-			if (!folio_test_anon(old_folio)) {
-				sub_mm_counter(mm, mm_counter_file(old_folio), 1);
-				add_mm_counter(mm, MM_ANONPAGES, 1);
-			}
-		} else {
-			ksm_might_unmap_zero_page(mm, vmf->orig_pte);
-			inc_mm_counter(mm, MM_ANONPAGES);
-		}
-		flush_cache_range(vma, vmf->address, vmf->address + PAGE_SIZE);
-		entry = folio_mk_pte(new_folio, vma->vm_page_prot);
-		entry = pte_sw_mkyoung(entry);
-		if (unlikely(unshare)) {
-			if (pte_soft_dirty(vmf->orig_pte))
-				entry = pte_mksoft_dirty(entry);
-			if (pte_uffd_wp(vmf->orig_pte))
-				entry = pte_mkuffd_wp(entry);
-		} else {
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+	if (unlikely(!vmf->pte))
+		goto release;
+	if (unlikely(vmf_pte_changed(vmf))) {
+		update_mmu_tlb(vma, addr, vmf->pte);
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		goto release;
+	}
+
+	if (old_folio) {
+		if (!folio_test_anon(old_folio)) {
+			sub_mm_counter(mm, mm_counter_file(old_folio), 1);
+			add_mm_counter(mm, MM_ANONPAGES, 1);
 		}
+	} else {
+		ksm_might_unmap_zero_page(mm, vmf->orig_pte);
+		inc_mm_counter(mm, MM_ANONPAGES);
+	}
+	flush_cache_range(vma, addr, addr + PAGE_SIZE);
+	entry = folio_mk_pte(new_folio, vma->vm_page_prot);
+	entry = pte_sw_mkyoung(entry);
+	if (unlikely(unshare)) {
+		if (pte_soft_dirty(vmf->orig_pte))
+			entry = pte_mksoft_dirty(entry);
+		if (pte_uffd_wp(vmf->orig_pte))
+			entry = pte_mkuffd_wp(entry);
+	} else {
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	}
 
+	/*
+	 * Clear the pte entry and flush it first, before updating the
+	 * pte with the new entry, to keep TLBs on different CPUs in
+	 * sync. This code used to set the new PTE then flush TLBs, but
+	 * that left a window where the new PTE could be loaded into
+	 * some TLBs while the old PTE remains in others.
+	 */
+	ptep_clear_flush_range(vma, addr, vmf->pte, 1);
+	folio_add_new_anon_rmap(new_folio, vma, addr, RMAP_EXCLUSIVE);
+	folio_add_lru_vma(new_folio, vma);
+	BUG_ON(unshare && pte_write(entry));
+	set_ptes(mm, addr, vmf->pte, entry, 1);
+	update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1);
+	if (old_folio) {
 		/*
-		 * Clear the pte entry and flush it first, before updating the
-		 * pte with the new entry, to keep TLBs on different CPUs in
-		 * sync. This code used to set the new PTE then flush TLBs, but
-		 * that left a window where the new PTE could be loaded into
-		 * some TLBs while the old PTE remains in others.
+		 * Only after switching the pte to the new page may
+		 * we remove the mapcount here. Otherwise another
+		 * process may come and find the rmap count decremented
+		 * before the pte is switched to the new page, and
+		 * "reuse" the old page writing into it while our pte
+		 * here still points into it and can be read by other
+		 * threads.
+		 *
+		 * The critical issue is to order this
+		 * folio_remove_rmap_pte() with the ptp_clear_flush
+		 * above. Those stores are ordered by (if nothing else,)
+		 * the barrier present in the atomic_add_negative
+		 * in folio_remove_rmap_pte();
+		 *
+		 * Then the TLB flush in ptep_clear_flush ensures that
+		 * no process can access the old page before the
+		 * decremented mapcount is visible. And the old page
+		 * cannot be reused until after the decremented
+		 * mapcount is visible. So transitively, TLBs to
+		 * old page will be flushed before it can be reused.
 		 */
-		ptep_clear_flush_range(vma, vmf->address, vmf->pte, 1);
-		folio_add_new_anon_rmap(new_folio, vma, vmf->address, RMAP_EXCLUSIVE);
-		folio_add_lru_vma(new_folio, vma);
-		BUG_ON(unshare && pte_write(entry));
-		set_ptes(mm, vmf->address, vmf->pte, entry, 1);
-		update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
-		if (old_folio) {
-			/*
-			 * Only after switching the pte to the new page may
-			 * we remove the mapcount here. Otherwise another
-			 * process may come and find the rmap count decremented
-			 * before the pte is switched to the new page, and
-			 * "reuse" the old page writing into it while our pte
-			 * here still points into it and can be read by other
-			 * threads.
-			 *
-			 * The critical issue is to order this
-			 * folio_remove_rmap_pte() with the ptp_clear_flush
-			 * above. Those stores are ordered by (if nothing else,)
-			 * the barrier present in the atomic_add_negative
-			 * in folio_remove_rmap_pte();
-			 *
-			 * Then the TLB flush in ptep_clear_flush ensures that
-			 * no process can access the old page before the
-			 * decremented mapcount is visible. And the old page
-			 * cannot be reused until after the decremented
-			 * mapcount is visible. So transitively, TLBs to
-			 * old page will be flushed before it can be reused.
-			 */
-			folio_remove_rmap_ptes(old_folio, vmf->page, 1, vma);
-		}
-
-		/* Free the old page.. */
-		new_folio = old_folio;
-		page_copied = 1;
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-	} else if (vmf->pte) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		folio_remove_rmap_ptes(old_folio, old_page, 1, vma);
 	}
 
+	/* Free the old page.. */
+	new_folio = old_folio;
+	page_copied = 1;
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+release:
 	mmu_notifier_invalidate_range_end(&range);
 
 	if (new_folio)
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 6/7] mm: memory: add mTHP support for wp
  2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
                   ` (4 preceding siblings ...)
  2025-08-14 11:38 ` [RFC PATCH 5/7] mm: memory: improve wp_page_copy readability Vernon Yang
@ 2025-08-14 11:38 ` Vernon Yang
  2025-08-14 11:58   ` David Hildenbrand
  2025-08-14 12:57   ` David Hildenbrand
  2025-08-14 11:38 ` [RFC PATCH 7/7] selftests: mm: support wp mTHP collapse testing Vernon Yang
  6 siblings, 2 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-14 11:38 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, glider, elver, dvyukov,
	vbabka, rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran
  Cc: linux-mm, linux-kernel, Vernon Yang

Currently pagefaults on anonymous pages support mthp, and hardware
features (such as arm64 contpte) can be used to store multiple ptes in
one TLB entry, reducing the probability of TLB misses. However, when the
process is forked and the cow is triggered again, the above optimization
effect is lost, and only 4KB is requested once at a time.

Therefore, make pagefault write-protect copy support mthp to maintain the
optimization effect of TLB and improve the efficiency of cow pagefault.

vm-scalability usemem shows a great improvement,
test using: usemem -n 32 --prealloc --prefault 249062617
(result unit is KB/s, bigger is better)

|    size     | w/o patch | w/ patch  |  delta  |
|-------------|-----------|-----------|---------|
| baseline 4K | 723041.63 | 717643.21 | -0.75%  |
| mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
| mthp 32K    | 746060.91 | 836261.83 | +12.09% |
| mthp 64K    | 747333.18 | 855570.43 | +14.48% |

Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
---
 include/linux/huge_mm.h |   3 +
 mm/memory.c             | 174 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 163 insertions(+), 14 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..d1ebbe0636fb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -132,6 +132,9 @@ enum mthp_stat_item {
 	MTHP_STAT_SHMEM_ALLOC,
 	MTHP_STAT_SHMEM_FALLBACK,
 	MTHP_STAT_SHMEM_FALLBACK_CHARGE,
+	MTHP_STAT_WP_FAULT_ALLOC,
+	MTHP_STAT_WP_FAULT_FALLBACK,
+	MTHP_STAT_WP_FAULT_FALLBACK_CHARGE,
 	MTHP_STAT_SPLIT,
 	MTHP_STAT_SPLIT_FAILED,
 	MTHP_STAT_SPLIT_DEFERRED,
diff --git a/mm/memory.c b/mm/memory.c
index 8dd869b0cfc1..ea84c49cc975 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3344,6 +3344,21 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
 	return ret;
 }
 
+static inline int __wp_folio_copy_user(struct folio *dst, struct folio *src,
+				       unsigned int offset,
+				       struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	void __user *uaddr;
+
+	if (likely(src))
+		return copy_user_large_folio(dst, src, offset, vmf->address, vma);
+
+	uaddr = (void __user *)ALIGN_DOWN(vmf->address, folio_size(dst));
+
+	return copy_folio_from_user(dst, uaddr, 0);
+}
+
 static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
 {
 	struct file *vm_file = vma->vm_file;
@@ -3527,6 +3542,119 @@ vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
 	return ret;
 }
 
+static inline unsigned long thp_wp_suitable_orders(struct folio *old_folio,
+						   unsigned long orders)
+{
+	int order, max_order;
+
+	max_order = folio_order(old_folio);
+	order = highest_order(orders);
+
+	/*
+	 * Since need to copy content from the old folio to the new folio, the
+	 * maximum size of the new folio will not exceed the old folio size,
+	 * so filter the inappropriate order.
+	 */
+	while (orders) {
+		if (order <= max_order)
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+
+static bool pte_range_readonly(pte_t *pte, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++) {
+		if (pte_write(ptep_get_lockless(pte + i)))
+			return false;
+	}
+
+	return true;
+}
+
+static struct folio *alloc_wp_folio(struct vm_fault *vmf, bool pfn_is_zero)
+{
+	struct vm_area_struct *vma = vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long orders;
+	struct folio *folio;
+	unsigned long addr;
+	pte_t *pte;
+	gfp_t gfp;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (unlikely(userfaultfd_armed(vma)))
+		goto fallback;
+
+	if (pfn_is_zero || !vmf->page)
+		goto fallback;
+
+	/*
+	 * Get a list of all the (large) orders below folio_order() that are enabled
+	 * for this vma. Then filter out the orders that can't be allocated over
+	 * the faulting address and still be fully contained in the vma.
+	 */
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	orders = thp_wp_suitable_orders(page_folio(vmf->page), orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	if (!pte)
+		return ERR_PTR(-EAGAIN);
+
+	/*
+	 * Find the highest order where the aligned range is completely readonly.
+	 * Note that all remaining orders will be completely readonly.
+	 */
+	order = highest_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		if (pte_range_readonly(pte + pte_index(addr), 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	pte_unmap(pte);
+
+	if (!orders)
+		goto fallback;
+
+	/* Try allocating the highest of the remaining orders. */
+	gfp = vma_thp_gfp_mask(vma);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr);
+		if (folio) {
+			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
+				count_mthp_stat(order, MTHP_STAT_WP_FAULT_FALLBACK_CHARGE);
+				folio_put(folio);
+				goto next;
+			}
+			folio_throttle_swaprate(folio, gfp);
+			return folio;
+		}
+next:
+		count_mthp_stat(order, MTHP_STAT_WP_FAULT_FALLBACK);
+		order = next_order(&orders, order);
+	}
+
+fallback:
+#endif
+	return folio_prealloc(vma->vm_mm, vma, vmf->address, pfn_is_zero);
+}
+
 /*
  * Handle the case of a page which we actually need to copy to a new page,
  * either due to COW or unsharing.
@@ -3558,6 +3686,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	vm_fault_t ret;
 	bool pfn_is_zero;
 	unsigned long addr;
+	int nr_pages;
 
 	delayacct_wpcopy_start();
 
@@ -3568,16 +3697,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto out;
 
 	pfn_is_zero = is_zero_pfn(pte_pfn(vmf->orig_pte));
-	new_folio = folio_prealloc(mm, vma, vmf->address, pfn_is_zero);
+	/* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
+	new_folio = alloc_wp_folio(vmf, pfn_is_zero);
+	if (IS_ERR(new_folio))
+		return 0;
 	if (!new_folio)
 		goto oom;
 
-	addr = ALIGN_DOWN(vmf->address, PAGE_SIZE);
+	nr_pages = folio_nr_pages(new_folio);
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+	old_page -= (vmf->address - addr) >> PAGE_SHIFT;
 
 	if (!pfn_is_zero) {
 		int err;
 
-		err = __wp_page_copy_user(&new_folio->page, old_page, vmf);
+		if (nr_pages == 1)
+			err = __wp_page_copy_user(&new_folio->page, old_page, vmf);
+		else
+			err = __wp_folio_copy_user(new_folio, old_folio,
+					folio_page_idx(old_folio, old_page), vmf);
+
 		if (err) {
 			/*
 			 * COW failed, if the fault was solved by other,
@@ -3593,13 +3732,13 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			delayacct_wpcopy_end();
 			return err == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
 		}
-		kmsan_copy_pages_meta(&new_folio->page, old_page, 1);
+		kmsan_copy_pages_meta(&new_folio->page, old_page, nr_pages);
 	}
 
 	__folio_mark_uptodate(new_folio);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
-				addr, addr + PAGE_SIZE);
+				addr, addr + nr_pages * PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 
 	/*
@@ -3608,22 +3747,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
 	if (unlikely(!vmf->pte))
 		goto release;
-	if (unlikely(vmf_pte_changed(vmf))) {
+	if (unlikely(nr_pages == 1 && vmf_pte_changed(vmf))) {
 		update_mmu_tlb(vma, addr, vmf->pte);
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		goto release;
+	} else if (nr_pages > 1 && !pte_range_readonly(vmf->pte, nr_pages)) {
+		update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		goto release;
 	}
 
 	if (old_folio) {
 		if (!folio_test_anon(old_folio)) {
-			sub_mm_counter(mm, mm_counter_file(old_folio), 1);
-			add_mm_counter(mm, MM_ANONPAGES, 1);
+			sub_mm_counter(mm, mm_counter_file(old_folio), nr_pages);
+			add_mm_counter(mm, MM_ANONPAGES, nr_pages);
 		}
 	} else {
 		ksm_might_unmap_zero_page(mm, vmf->orig_pte);
 		inc_mm_counter(mm, MM_ANONPAGES);
 	}
-	flush_cache_range(vma, addr, addr + PAGE_SIZE);
+	flush_cache_range(vma, addr, addr + nr_pages * PAGE_SIZE);
 	entry = folio_mk_pte(new_folio, vma->vm_page_prot);
 	entry = pte_sw_mkyoung(entry);
 	if (unlikely(unshare)) {
@@ -3642,12 +3785,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	 * that left a window where the new PTE could be loaded into
 	 * some TLBs while the old PTE remains in others.
 	 */
-	ptep_clear_flush_range(vma, addr, vmf->pte, 1);
+	ptep_clear_flush_range(vma, addr, vmf->pte, nr_pages);
+	folio_ref_add(new_folio, nr_pages - 1);
+	count_mthp_stat(folio_order(new_folio), MTHP_STAT_WP_FAULT_ALLOC);
 	folio_add_new_anon_rmap(new_folio, vma, addr, RMAP_EXCLUSIVE);
 	folio_add_lru_vma(new_folio, vma);
 	BUG_ON(unshare && pte_write(entry));
-	set_ptes(mm, addr, vmf->pte, entry, 1);
-	update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1);
+	set_ptes(mm, addr, vmf->pte, entry, nr_pages);
+	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
 	if (old_folio) {
 		/*
 		 * Only after switching the pte to the new page may
@@ -3671,7 +3816,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * mapcount is visible. So transitively, TLBs to
 		 * old page will be flushed before it can be reused.
 		 */
-		folio_remove_rmap_ptes(old_folio, old_page, 1, vma);
+		folio_remove_rmap_ptes(old_folio, old_page, nr_pages, vma);
 	}
 
 	/* Free the old page.. */
@@ -3682,7 +3827,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	mmu_notifier_invalidate_range_end(&range);
 
 	if (new_folio)
-		folio_put_refs(new_folio, 1);
+		folio_put_refs(new_folio, page_copied ? nr_pages : 1);
+
 	if (old_folio) {
 		if (page_copied)
 			free_swap_cache(old_folio);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 7/7] selftests: mm: support wp mTHP collapse testing
  2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
                   ` (5 preceding siblings ...)
  2025-08-14 11:38 ` [RFC PATCH 6/7] mm: memory: add mTHP support for wp Vernon Yang
@ 2025-08-14 11:38 ` Vernon Yang
  6 siblings, 0 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-14 11:38 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, glider, elver, dvyukov,
	vbabka, rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran
  Cc: linux-mm, linux-kernel, Vernon Yang

Add wp mTHP collpase testing. Similar to the anonymous page, users can
use the '-s' parameter to specify the wp mTHP size for testing.

Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
---
 tools/testing/selftests/mm/khugepaged.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 8a4d34cce36b..143c4ad9f6a1 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -981,6 +981,7 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o
 static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops)
 {
 	int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared");
+	int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1;
 	int wstatus;
 	void *p;
 
@@ -997,8 +998,8 @@ static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops
 			fail("Fail");
 
 		printf("Trigger CoW on page %d of %d...",
-				hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
-		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
+				hpage_pmd_nr - max_ptes_shared - fault_nr_pages, hpage_pmd_nr);
+		ops->fault(p, 0, (hpage_pmd_nr - max_ptes_shared - fault_nr_pages) * page_size);
 		if (ops->check_huge(p, 0))
 			success("OK");
 		else
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 6/7] mm: memory: add mTHP support for wp
  2025-08-14 11:38 ` [RFC PATCH 6/7] mm: memory: add mTHP support for wp Vernon Yang
@ 2025-08-14 11:58   ` David Hildenbrand
  2025-08-15 15:20     ` Vernon Yang
  2025-08-14 12:57   ` David Hildenbrand
  1 sibling, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-08-14 11:58 UTC (permalink / raw)
  To: Vernon Yang, akpm, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, glider,
	elver, dvyukov, vbabka, rppt, surenb, mhocko, muchun.song,
	osalvador, shuah, richardcochran
  Cc: linux-mm, linux-kernel

On 14.08.25 13:38, Vernon Yang wrote:
> Currently pagefaults on anonymous pages support mthp, and hardware
> features (such as arm64 contpte) can be used to store multiple ptes in
> one TLB entry, reducing the probability of TLB misses. However, when the
> process is forked and the cow is triggered again, the above optimization
> effect is lost, and only 4KB is requested once at a time.
> 
> Therefore, make pagefault write-protect copy support mthp to maintain the
> optimization effect of TLB and improve the efficiency of cow pagefault.
> 
> vm-scalability usemem shows a great improvement,
> test using: usemem -n 32 --prealloc --prefault 249062617
> (result unit is KB/s, bigger is better)
> 
> |    size     | w/o patch | w/ patch  |  delta  |
> |-------------|-----------|-----------|---------|
> | baseline 4K | 723041.63 | 717643.21 | -0.75%  |
> | mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
> | mthp 32K    | 746060.91 | 836261.83 | +12.09% |
> | mthp 64K    | 747333.18 | 855570.43 | +14.48% |

You're missing two of the most important metrics: COW latency and memory 
waste.

Just imagine what happens if you have PMD-sized THP.

I would suggest you explore why Redis used to recommend to disable THPs 
(hint: tail latency due to COW of way-too-large chunks before we do what 
we do today).

So staring at usemem micro-benchmark results is a bit misleading.

As discussed in the past, I would actually suggest to

a) Let khugepaged deal with fixing this up later, keeping CoW path
    simpler and faster.
b) If we really really have to do this during fault time, limit it to
    some order (might even be have to be configurable).

I really think we should keep CoW latency low and instead let khugepaged 
fix that up later. (Nico is working on mTHP collapse support)

[are you handling having a mixture of PageAnonExclusive within a folio 
properly? Only staring at R/O PTEs is usually insufficient to determine 
whether you can COW or whether you must reuse].

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 6/7] mm: memory: add mTHP support for wp
  2025-08-14 11:38 ` [RFC PATCH 6/7] mm: memory: add mTHP support for wp Vernon Yang
  2025-08-14 11:58   ` David Hildenbrand
@ 2025-08-14 12:57   ` David Hildenbrand
  2025-08-15 15:30     ` Vernon Yang
  1 sibling, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-08-14 12:57 UTC (permalink / raw)
  To: Vernon Yang, akpm, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, glider,
	elver, dvyukov, vbabka, rppt, surenb, mhocko, muchun.song,
	osalvador, shuah, richardcochran
  Cc: linux-mm, linux-kernel

On 14.08.25 13:38, Vernon Yang wrote:
> Currently pagefaults on anonymous pages support mthp, and hardware
> features (such as arm64 contpte) can be used to store multiple ptes in
> one TLB entry, reducing the probability of TLB misses. However, when the
> process is forked and the cow is triggered again, the above optimization
> effect is lost, and only 4KB is requested once at a time.
> 
> Therefore, make pagefault write-protect copy support mthp to maintain the
> optimization effect of TLB and improve the efficiency of cow pagefault.
> 
> vm-scalability usemem shows a great improvement,
> test using: usemem -n 32 --prealloc --prefault 249062617
> (result unit is KB/s, bigger is better)
> 
> |    size     | w/o patch | w/ patch  |  delta  |
> |-------------|-----------|-----------|---------|
> | baseline 4K | 723041.63 | 717643.21 | -0.75%  |
> | mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
> | mthp 32K    | 746060.91 | 836261.83 | +12.09% |
> | mthp 64K    | 747333.18 | 855570.43 | +14.48% |
> 
> Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
> ---
>   include/linux/huge_mm.h |   3 +
>   mm/memory.c             | 174 ++++++++++++++++++++++++++++++++++++----
>   2 files changed, 163 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..d1ebbe0636fb 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -132,6 +132,9 @@ enum mthp_stat_item {
>   	MTHP_STAT_SHMEM_ALLOC,
>   	MTHP_STAT_SHMEM_FALLBACK,
>   	MTHP_STAT_SHMEM_FALLBACK_CHARGE,
> +	MTHP_STAT_WP_FAULT_ALLOC,
> +	MTHP_STAT_WP_FAULT_FALLBACK,
> +	MTHP_STAT_WP_FAULT_FALLBACK_CHARGE,
>   	MTHP_STAT_SPLIT,
>   	MTHP_STAT_SPLIT_FAILED,
>   	MTHP_STAT_SPLIT_DEFERRED,
> diff --git a/mm/memory.c b/mm/memory.c
> index 8dd869b0cfc1..ea84c49cc975 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3344,6 +3344,21 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
>   	return ret;
>   }
>   
> +static inline int __wp_folio_copy_user(struct folio *dst, struct folio *src,
> +				       unsigned int offset,
> +				       struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	void __user *uaddr;
> +
> +	if (likely(src))
> +		return copy_user_large_folio(dst, src, offset, vmf->address, vma);
> +
> +	uaddr = (void __user *)ALIGN_DOWN(vmf->address, folio_size(dst));
> +
> +	return copy_folio_from_user(dst, uaddr, 0);
> +}
> +
>   static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
>   {
>   	struct file *vm_file = vma->vm_file;
> @@ -3527,6 +3542,119 @@ vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
>   	return ret;
>   }
>   
> +static inline unsigned long thp_wp_suitable_orders(struct folio *old_folio,
> +						   unsigned long orders)
> +{
> +	int order, max_order;
> +
> +	max_order = folio_order(old_folio);
> +	order = highest_order(orders);
> +
> +	/*
> +	 * Since need to copy content from the old folio to the new folio, the
> +	 * maximum size of the new folio will not exceed the old folio size,
> +	 * so filter the inappropriate order.
> +	 */
> +	while (orders) {
> +		if (order <= max_order)
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	return orders;
> +}
> +
> +static bool pte_range_readonly(pte_t *pte, int nr_pages)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pte_write(ptep_get_lockless(pte + i)))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static struct folio *alloc_wp_folio(struct vm_fault *vmf, bool pfn_is_zero)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	unsigned long orders;
> +	struct folio *folio;
> +	unsigned long addr;
> +	pte_t *pte;
> +	gfp_t gfp;
> +	int order;
> +
> +	/*
> +	 * If uffd is active for the vma we need per-page fault fidelity to
> +	 * maintain the uffd semantics.
> +	 */
> +	if (unlikely(userfaultfd_armed(vma)))
> +		goto fallback;
> +
> +	if (pfn_is_zero || !vmf->page)
> +		goto fallback;
> +
> +	/*
> +	 * Get a list of all the (large) orders below folio_order() that are enabled
> +	 * for this vma. Then filter out the orders that can't be allocated over
> +	 * the faulting address and still be fully contained in the vma.
> +	 */
> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> +	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +	orders = thp_wp_suitable_orders(page_folio(vmf->page), orders);
> +
> +	if (!orders)
> +		goto fallback;
> +
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +	if (!pte)
> +		return ERR_PTR(-EAGAIN);
> +
> +	/*
> +	 * Find the highest order where the aligned range is completely readonly.
> +	 * Note that all remaining orders will be completely readonly.
> +	 */
> +	order = highest_order(orders);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		if (pte_range_readonly(pte + pte_index(addr), 1 << order))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	pte_unmap(pte);
> +
> +	if (!orders)
> +		goto fallback;
> +
> +	/* Try allocating the highest of the remaining orders. */
> +	gfp = vma_thp_gfp_mask(vma);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		folio = vma_alloc_folio(gfp, order, vma, addr);
> +		if (folio) {
> +			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> +				count_mthp_stat(order, MTHP_STAT_WP_FAULT_FALLBACK_CHARGE);
> +				folio_put(folio);
> +				goto next;
> +			}
> +			folio_throttle_swaprate(folio, gfp);
> +			return folio;
> +		}

I might be missing something, but besides the PAE issue I think there 
are more issues lurking here:

* Are you scanning outside of the current VMA, and some PTEs might
   actually belong to a !writable VMA?
* Are you assuming that the R/O PTE range is actually mapping all-pages
   from the same large folio?

I am not sure if you are assuming some natural alignment of the old 
folio. Due to mremap() that must not be the case.

Which stresses my point: khugepaged might be the better place to 
re-collapse where reasonable, avoiding further complexity in our CoW 
handling.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 6/7] mm: memory: add mTHP support for wp
  2025-08-14 11:58   ` David Hildenbrand
@ 2025-08-15 15:20     ` Vernon Yang
  2025-08-16  6:40       ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Vernon Yang @ 2025-08-15 15:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, glider, elver, dvyukov, vbabka,
	rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran, linux-mm, linux-kernel

On Thu, Aug 14, 2025 at 01:58:34PM +0200, David Hildenbrand wrote:
> On 14.08.25 13:38, Vernon Yang wrote:
> > Currently pagefaults on anonymous pages support mthp, and hardware
> > features (such as arm64 contpte) can be used to store multiple ptes in
> > one TLB entry, reducing the probability of TLB misses. However, when the
> > process is forked and the cow is triggered again, the above optimization
> > effect is lost, and only 4KB is requested once at a time.
> >
> > Therefore, make pagefault write-protect copy support mthp to maintain the
> > optimization effect of TLB and improve the efficiency of cow pagefault.
> >
> > vm-scalability usemem shows a great improvement,
> > test using: usemem -n 32 --prealloc --prefault 249062617
> > (result unit is KB/s, bigger is better)
> >
> > |    size     | w/o patch | w/ patch  |  delta  |
> > |-------------|-----------|-----------|---------|
> > | baseline 4K | 723041.63 | 717643.21 | -0.75%  |
> > | mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
> > | mthp 32K    | 746060.91 | 836261.83 | +12.09% |
> > | mthp 64K    | 747333.18 | 855570.43 | +14.48% |
>
> You're missing two of the most important metrics: COW latency and memory
> waste.

OK, I will add the above two test later.

>
> Just imagine what happens if you have PMD-sized THP.
>
> I would suggest you explore why Redis used to recommend to disable THPs
> (hint: tail latency due to COW of way-too-large chunks before we do what we
> do today).

Thanks for the suggestion, I'm not very familiar with Redis indeed. Currently,
this series supports small granularity sizes, such as 16KB, and I will also
test redis-benchmark later to see the severity of tail latency.

>
> So staring at usemem micro-benchmark results is a bit misleading.
>
> As discussed in the past, I would actually suggest to
>
> a) Let khugepaged deal with fixing this up later, keeping CoW path
>    simpler and faster.
> b) If we really really have to do this during fault time, limit it to
>    some order (might even be have to be configurable).

This is a good way to add a similar shmem_enabled knob after if need.

>
> I really think we should keep CoW latency low and instead let khugepaged fix
> that up later. (Nico is working on mTHP collapse support)
>
> [are you handling having a mixture of PageAnonExclusive within a folio
> properly? Only staring at R/O PTEs is usually insufficient to determine
> whether you can COW or whether you must reuse].

There is no extra processing on PageAnonExclusive here, only judging by R/O PTEs,
thank you for pointing it out, and I will look into how to properly handle
this situation later.

>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 6/7] mm: memory: add mTHP support for wp
  2025-08-14 12:57   ` David Hildenbrand
@ 2025-08-15 15:30     ` Vernon Yang
  0 siblings, 0 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-15 15:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, glider, elver, dvyukov, vbabka,
	rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran, linux-mm, linux-kernel

On Thu, Aug 14, 2025 at 02:57:34PM +0200, David Hildenbrand wrote:
> On 14.08.25 13:38, Vernon Yang wrote:
> > Currently pagefaults on anonymous pages support mthp, and hardware
> > features (such as arm64 contpte) can be used to store multiple ptes in
> > one TLB entry, reducing the probability of TLB misses. However, when the
> > process is forked and the cow is triggered again, the above optimization
> > effect is lost, and only 4KB is requested once at a time.
> >
> > Therefore, make pagefault write-protect copy support mthp to maintain the
> > optimization effect of TLB and improve the efficiency of cow pagefault.
> >
> > vm-scalability usemem shows a great improvement,
> > test using: usemem -n 32 --prealloc --prefault 249062617
> > (result unit is KB/s, bigger is better)
> >
> > |    size     | w/o patch | w/ patch  |  delta  |
> > |-------------|-----------|-----------|---------|
> > | baseline 4K | 723041.63 | 717643.21 | -0.75%  |
> > | mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
> > | mthp 32K    | 746060.91 | 836261.83 | +12.09% |
> > | mthp 64K    | 747333.18 | 855570.43 | +14.48% |
> >
> > Signed-off-by: Vernon Yang <vernon2gm@gmail.com>
> > ---
> >   include/linux/huge_mm.h |   3 +
> >   mm/memory.c             | 174 ++++++++++++++++++++++++++++++++++++----
> >   2 files changed, 163 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 2f190c90192d..d1ebbe0636fb 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -132,6 +132,9 @@ enum mthp_stat_item {
> >   	MTHP_STAT_SHMEM_ALLOC,
> >   	MTHP_STAT_SHMEM_FALLBACK,
> >   	MTHP_STAT_SHMEM_FALLBACK_CHARGE,
> > +	MTHP_STAT_WP_FAULT_ALLOC,
> > +	MTHP_STAT_WP_FAULT_FALLBACK,
> > +	MTHP_STAT_WP_FAULT_FALLBACK_CHARGE,
> >   	MTHP_STAT_SPLIT,
> >   	MTHP_STAT_SPLIT_FAILED,
> >   	MTHP_STAT_SPLIT_DEFERRED,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8dd869b0cfc1..ea84c49cc975 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3344,6 +3344,21 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> >   	return ret;
> >   }
> > +static inline int __wp_folio_copy_user(struct folio *dst, struct folio *src,
> > +				       unsigned int offset,
> > +				       struct vm_fault *vmf)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	void __user *uaddr;
> > +
> > +	if (likely(src))
> > +		return copy_user_large_folio(dst, src, offset, vmf->address, vma);
> > +
> > +	uaddr = (void __user *)ALIGN_DOWN(vmf->address, folio_size(dst));
> > +
> > +	return copy_folio_from_user(dst, uaddr, 0);
> > +}
> > +
> >   static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
> >   {
> >   	struct file *vm_file = vma->vm_file;
> > @@ -3527,6 +3542,119 @@ vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
> >   	return ret;
> >   }
> > +static inline unsigned long thp_wp_suitable_orders(struct folio *old_folio,
> > +						   unsigned long orders)
> > +{
> > +	int order, max_order;
> > +
> > +	max_order = folio_order(old_folio);
> > +	order = highest_order(orders);
> > +
> > +	/*
> > +	 * Since need to copy content from the old folio to the new folio, the
> > +	 * maximum size of the new folio will not exceed the old folio size,
> > +	 * so filter the inappropriate order.
> > +	 */
> > +	while (orders) {
> > +		if (order <= max_order)
> > +			break;
> > +		order = next_order(&orders, order);
> > +	}
> > +
> > +	return orders;
> > +}
> > +
> > +static bool pte_range_readonly(pte_t *pte, int nr_pages)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < nr_pages; i++) {
> > +		if (pte_write(ptep_get_lockless(pte + i)))
> > +			return false;
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +static struct folio *alloc_wp_folio(struct vm_fault *vmf, bool pfn_is_zero)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +	unsigned long orders;
> > +	struct folio *folio;
> > +	unsigned long addr;
> > +	pte_t *pte;
> > +	gfp_t gfp;
> > +	int order;
> > +
> > +	/*
> > +	 * If uffd is active for the vma we need per-page fault fidelity to
> > +	 * maintain the uffd semantics.
> > +	 */
> > +	if (unlikely(userfaultfd_armed(vma)))
> > +		goto fallback;
> > +
> > +	if (pfn_is_zero || !vmf->page)
> > +		goto fallback;
> > +
> > +	/*
> > +	 * Get a list of all the (large) orders below folio_order() that are enabled
> > +	 * for this vma. Then filter out the orders that can't be allocated over
> > +	 * the faulting address and still be fully contained in the vma.
> > +	 */
> > +	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > +			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> > +	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > +	orders = thp_wp_suitable_orders(page_folio(vmf->page), orders);
> > +
> > +	if (!orders)
> > +		goto fallback;
> > +
> > +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> > +	if (!pte)
> > +		return ERR_PTR(-EAGAIN);
> > +
> > +	/*
> > +	 * Find the highest order where the aligned range is completely readonly.
> > +	 * Note that all remaining orders will be completely readonly.
> > +	 */
> > +	order = highest_order(orders);
> > +	while (orders) {
> > +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > +		if (pte_range_readonly(pte + pte_index(addr), 1 << order))
> > +			break;
> > +		order = next_order(&orders, order);
> > +	}
> > +
> > +	pte_unmap(pte);
> > +
> > +	if (!orders)
> > +		goto fallback;
> > +
> > +	/* Try allocating the highest of the remaining orders. */
> > +	gfp = vma_thp_gfp_mask(vma);
> > +	while (orders) {
> > +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > +		folio = vma_alloc_folio(gfp, order, vma, addr);
> > +		if (folio) {
> > +			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> > +				count_mthp_stat(order, MTHP_STAT_WP_FAULT_FALLBACK_CHARGE);
> > +				folio_put(folio);
> > +				goto next;
> > +			}
> > +			folio_throttle_swaprate(folio, gfp);
> > +			return folio;
> > +		}
>
> I might be missing something, but besides the PAE issue I think there are
> more issues lurking here:
>
> * Are you scanning outside of the current VMA, and some PTEs might
>   actually belong to a !writable VMA?

In thp_vma_suitable_order(), it not exceed the size of the current VMA,
and all PTEs belong to current writable VMA.

> * Are you assuming that the R/O PTE range is actually mapping all-pages
>   from the same large folio?

Yes, is there a potential problem with this assumption? maybe I'm missing
something.

>
> I am not sure if you are assuming some natural alignment of the old folio.
> Due to mremap() that must not be the case.

Here it is assumed that the virtual address aligns the old folio size,
the mremap would break that assumption, right?

>
> Which stresses my point: khugepaged might be the better place to re-collapse
> where reasonable, avoiding further complexity in our CoW handling.
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 6/7] mm: memory: add mTHP support for wp
  2025-08-15 15:20     ` Vernon Yang
@ 2025-08-16  6:40       ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-08-16  6:40 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, glider, elver, dvyukov, vbabka,
	rppt, surenb, mhocko, muchun.song, osalvador, shuah,
	richardcochran, linux-mm, linux-kernel

On 15.08.25 17:20, Vernon Yang wrote:
> On Thu, Aug 14, 2025 at 01:58:34PM +0200, David Hildenbrand wrote:
>> On 14.08.25 13:38, Vernon Yang wrote:
>>> Currently pagefaults on anonymous pages support mthp, and hardware
>>> features (such as arm64 contpte) can be used to store multiple ptes in
>>> one TLB entry, reducing the probability of TLB misses. However, when the
>>> process is forked and the cow is triggered again, the above optimization
>>> effect is lost, and only 4KB is requested once at a time.
>>>
>>> Therefore, make pagefault write-protect copy support mthp to maintain the
>>> optimization effect of TLB and improve the efficiency of cow pagefault.
>>>
>>> vm-scalability usemem shows a great improvement,
>>> test using: usemem -n 32 --prealloc --prefault 249062617
>>> (result unit is KB/s, bigger is better)
>>>
>>> |    size     | w/o patch | w/ patch  |  delta  |
>>> |-------------|-----------|-----------|---------|
>>> | baseline 4K | 723041.63 | 717643.21 | -0.75%  |
>>> | mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
>>> | mthp 32K    | 746060.91 | 836261.83 | +12.09% |
>>> | mthp 64K    | 747333.18 | 855570.43 | +14.48% |
>>
>> You're missing two of the most important metrics: COW latency and memory
>> waste.
> 
> OK, I will add the above two test later.
> 
>>
>> Just imagine what happens if you have PMD-sized THP.
>>
>> I would suggest you explore why Redis used to recommend to disable THPs
>> (hint: tail latency due to COW of way-too-large chunks before we do what we
>> do today).
> 
> Thanks for the suggestion, I'm not very familiar with Redis indeed. Currently,
> this series supports small granularity sizes, such as 16KB, and I will also
> test redis-benchmark later to see the severity of tail latency.
> 
>>
>> So staring at usemem micro-benchmark results is a bit misleading.
>>
>> As discussed in the past, I would actually suggest to
>>
>> a) Let khugepaged deal with fixing this up later, keeping CoW path
>>     simpler and faster.
>> b) If we really really have to do this during fault time, limit it to
>>     some order (might even be have to be configurable).
> 
> This is a good way to add a similar shmem_enabled knob after if need.
> 
>>
>> I really think we should keep CoW latency low and instead let khugepaged fix
>> that up later. (Nico is working on mTHP collapse support)
>>
>> [are you handling having a mixture of PageAnonExclusive within a folio
>> properly? Only staring at R/O PTEs is usually insufficient to determine
>> whether you can COW or whether you must reuse].
> 
> There is no extra processing on PageAnonExclusive here, only judging by R/O PTEs,
> thank you for pointing it out, and I will look into how to properly handle
> this situation later.

Yes, but as I said: I much prefer to let khugepaged handle that. I am 
not convinced the complexity here is warranted.

Nico's patches should soon be in shape to collapse mthp. (see the list)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/7] add mTHP support for wp
@ 2025-08-19  0:55 zhangqilong
  2025-08-19 18:21 ` Vernon Yang
  0 siblings, 1 reply; 15+ messages in thread
From: zhangqilong @ 2025-08-19  0:55 UTC (permalink / raw)
  To: Vernon Yang, akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	baohua@kernel.org, glider@google.com, elver@google.com,
	dvyukov@google.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, muchun.song@linux.dev,
	osalvador@suse.de, shuah@kernel.org, richardcochran@gmail.com
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.or

> 
> Hi all,
> 
> This patchset is introduced to make pagefault write-protect copy support
> mthp, with this series, pagefault write-protect copy will have a 9-14%
> performance improvement.
> 
> Currently pagefaults on anonymous pages support mthp [1], and hardware
> features (such as arm64 contpte) can be used to store multiple ptes in one
> TLB entry, reducing the probability of TLB misses. However, when the
> process is forked and the cow is triggered again, the above optimization
> effect is lost, and only 4KB is requested once at a time.
> 
> Therefore, make pagefault write-protect copy support mthp to maintain the
> optimization effect of TLB and improve the efficiency of cow pagefault.
> 
> vm-scalability usemem shows a great improvement, test using: usemem -n
> 32 --prealloc --prefault 249062617 (result unit is KB/s, bigger is better)
> 
> |    size     | w/o patch | w/ patch  |  delta  |
> |-------------|-----------|-----------|---------|
> | baseline 4K | 723041.63 | 717643.21 | -0.75%  |
> | mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
> | mthp 32K    | 746060.91 | 836261.83 | +12.09% |
> | mthp 64K    | 747333.18 | 855570.43 | +14.48% |
> 
> This series is based on Linux v6.16 (038d61fd6422).
> 
> Thanks,
> Vernon
> 
> [1] https://lore.kernel.org/all/20231207161211.2374093-1-
> ryan.roberts@arm.com/
> 
> Vernon Yang (7):
>   mm: memory: replace single-operation with multi-operation in wp
>   mm: memory: add ptep_clear_flush_range function
>   mm: memory: add kmsan_copy_pages_meta function
>   mm: memory: add offset to start copy for copy_user_gigantic_page
>   mm: memory: improve wp_page_copy readability
>   mm: memory: add mTHP support for wp

Oh, we are also doing similar optimizations, but only for the code segment. :)

>   selftests: mm: support wp mTHP collapse testing
> 
>  include/linux/huge_mm.h                 |   3 +
>  include/linux/kmsan.h                   |  13 +-
>  include/linux/mm.h                      |   8 +
>  include/linux/pgtable.h                 |   3 +
>  mm/hugetlb.c                            |   6 +-
>  mm/kmsan/shadow.c                       |  26 +-
>  mm/memory.c                             | 309 ++++++++++++++++++------
>  mm/pgtable-generic.c                    |  20 ++
>  tools/testing/selftests/mm/khugepaged.c |   5 +-
>  9 files changed, 302 insertions(+), 91 deletions(-)
> 
> --
> 2.50.1
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/7] add mTHP support for wp
  2025-08-19  0:55 [RFC PATCH 0/7] add mTHP support for wp zhangqilong
@ 2025-08-19 18:21 ` Vernon Yang
  0 siblings, 0 replies; 15+ messages in thread
From: Vernon Yang @ 2025-08-19 18:21 UTC (permalink / raw)
  To: zhangqilong
  Cc: akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	baohua@kernel.org, glider@google.com, elver@google.com,
	dvyukov@google.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, muchun.song@linux.dev,
	osalvador@suse.de, shuah@kernel.org, richardcochran@gmail.com,
	linux-mm@kvack.org, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2659 bytes --]



> On Aug 19, 2025, at 08:55, zhangqilong <zhangqilong3@huawei.com> wrote:
> 
>> 
>> Hi all,
>> 
>> This patchset is introduced to make pagefault write-protect copy support
>> mthp, with this series, pagefault write-protect copy will have a 9-14%
>> performance improvement.
>> 
>> Currently pagefaults on anonymous pages support mthp [1], and hardware
>> features (such as arm64 contpte) can be used to store multiple ptes in one
>> TLB entry, reducing the probability of TLB misses. However, when the
>> process is forked and the cow is triggered again, the above optimization
>> effect is lost, and only 4KB is requested once at a time.
>> 
>> Therefore, make pagefault write-protect copy support mthp to maintain the
>> optimization effect of TLB and improve the efficiency of cow pagefault.
>> 
>> vm-scalability usemem shows a great improvement, test using: usemem -n
>> 32 --prealloc --prefault 249062617 (result unit is KB/s, bigger is better)
>> 
>> |    size     | w/o patch | w/ patch  |  delta  |
>> |-------------|-----------|-----------|---------|
>> | baseline 4K | 723041.63 | 717643.21 | -0.75%  |
>> | mthp 16K    | 732871.14 | 799513.18 | +9.09%  |
>> | mthp 32K    | 746060.91 | 836261.83 | +12.09% |
>> | mthp 64K    | 747333.18 | 855570.43 | +14.48% |
>> 
>> This series is based on Linux v6.16 (038d61fd6422).
>> 
>> Thanks,
>> Vernon
>> 
>> [1] https://lore.kernel.org/all/20231207161211.2374093-1-
>> ryan.roberts@arm.com/
>> 
>> Vernon Yang (7):
>> mm: memory: replace single-operation with multi-operation in wp
>> mm: memory: add ptep_clear_flush_range function
>> mm: memory: add kmsan_copy_pages_meta function
>> mm: memory: add offset to start copy for copy_user_gigantic_page
>> mm: memory: improve wp_page_copy readability
>> mm: memory: add mTHP support for wp
> 
> Oh, we are also doing similar optimizations, but only for the code segment. :)

Good! You use a similar optimizations in your code segment, which mthp size 
do you use? 

> 
>> selftests: mm: support wp mTHP collapse testing
>> 
>> include/linux/huge_mm.h                 |   3 +
>> include/linux/kmsan.h                   |  13 +-
>> include/linux/mm.h                      |   8 +
>> include/linux/pgtable.h                 |   3 +
>> mm/hugetlb.c                            |   6 +-
>> mm/kmsan/shadow.c                       |  26 +-
>> mm/memory.c                             | 309 ++++++++++++++++++------
>> mm/pgtable-generic.c                    |  20 ++
>> tools/testing/selftests/mm/khugepaged.c |   5 +-
>> 9 files changed, 302 insertions(+), 91 deletions(-)
>> 
>> --
>> 2.50.1

[-- Attachment #2: Type: text/html, Size: 9215 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-08-19 18:22 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14 11:38 [RFC PATCH 0/7] add mTHP support for wp Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 1/7] mm: memory: replace single-operation with multi-operation in wp Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 2/7] mm: memory: add ptep_clear_flush_range function Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 3/7] mm: memory: add kmsan_copy_pages_meta function Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 4/7] mm: memory: add offset to start copy for copy_user_gigantic_page Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 5/7] mm: memory: improve wp_page_copy readability Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 6/7] mm: memory: add mTHP support for wp Vernon Yang
2025-08-14 11:58   ` David Hildenbrand
2025-08-15 15:20     ` Vernon Yang
2025-08-16  6:40       ` David Hildenbrand
2025-08-14 12:57   ` David Hildenbrand
2025-08-15 15:30     ` Vernon Yang
2025-08-14 11:38 ` [RFC PATCH 7/7] selftests: mm: support wp mTHP collapse testing Vernon Yang
  -- strict thread matches above, loose matches on Subject: below --
2025-08-19  0:55 [RFC PATCH 0/7] add mTHP support for wp zhangqilong
2025-08-19 18:21 ` Vernon Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).