[PATCH v1 0/4] mm: folio_pte_batch() improvements

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/4] mm: folio_pte_batch() improvements
@ 2025-06-27 11:55 David Hildenbrand
  2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
                   ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 11:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

Ever since we added folio_pte_batch() for fork() + munmap() purposes,
a lot more users appeared (and more are being proposed), and more
functionality was added.

Most of the users only need basic functionality, and could benefit from
a non-inlined version.

So let's clean up folio_pte_batch() and split it into a basic
folio_pte_batch() (no flags) and a more advanced folio_pte_batch_ext().
Using either variant will now look much cleaner.

This series will likely conflict with some changes in some
(old+new) folio_pte_batch() users, but conflicts should be trivial to
resolve.

Tested on x86-64. Cross-compile tested.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Harry Yoo <harry.yoo@oracle.com>

David Hildenbrand (4):
  mm: convert FPB_IGNORE_* into FPB_HONOR_*
  mm: smaller folio_pte_batch() improvements
  mm: split folio_pte_batch() into folio_pte_batch() and
    folio_pte_batch_ext()
  mm: remove boolean output parameters from folio_pte_batch_ext()

 mm/internal.h  | 110 +++++++++++++++++++++++++++----------------------
 mm/madvise.c   |  27 +++---------
 mm/memory.c    |  21 ++++------
 mm/mempolicy.c |   5 +--
 mm/mlock.c     |   4 +-
 mm/mremap.c    |   4 +-
 mm/rmap.c      |   4 +-
 mm/util.c      |  29 +++++++++++++
 8 files changed, 105 insertions(+), 99 deletions(-)

base-commit: 0051fec1d393b659ffee707f869f8ffe4d1632e2
-- 
2.49.0

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 11:55 [PATCH v1 0/4] mm: folio_pte_batch() improvements David Hildenbrand
@ 2025-06-27 11:55 ` David Hildenbrand
  2025-06-27 13:40   ` Lance Yang
                     ` (4 more replies)
  2025-06-27 11:55 ` [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
                   ` (2 subsequent siblings)
  3 siblings, 5 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 11:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

Honoring these PTE bits is the exception, so let's invert the meaning.

With this change, most callers don't have to pass any flags.

No functional change intended.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h  | 16 ++++++++--------
 mm/madvise.c   |  3 +--
 mm/memory.c    | 11 +++++------
 mm/mempolicy.c |  4 +---
 mm/mlock.c     |  3 +--
 mm/mremap.c    |  3 +--
 mm/rmap.c      |  3 +--
 7 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e84217e27778d..9690c75063881 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
 /* Flags for folio_pte_batch(). */
 typedef int __bitwise fpb_t;
 
-/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
-#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
+/* Compare PTEs honoring the dirty bit. */
+#define FPB_HONOR_DIRTY		((__force fpb_t)BIT(0))
 
-/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
-#define FPB_IGNORE_SOFT_DIRTY		((__force fpb_t)BIT(1))
+/* Compare PTEs honoring the soft-dirty bit. */
+#define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
 
 static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 {
-	if (flags & FPB_IGNORE_DIRTY)
+	if (!(flags & FPB_HONOR_DIRTY))
 		pte = pte_mkclean(pte);
-	if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
+	if (likely(!(flags & FPB_HONOR_SOFT_DIRTY)))
 		pte = pte_clear_soft_dirty(pte);
 	return pte_wrprotect(pte_mkold(pte));
 }
@@ -236,8 +236,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * pages of the same large folio.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
- * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
- * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
+ * the accessed bit, writable bit, dirty bit (unless FPB_HONOR_DIRTY is set) and
+ * soft-dirty bit (unless FPB_HONOR_SOFT_DIRTY is set).
  *
  * start_ptep must map any page of the folio. max_nr must be at least one and
  * must be limited by the caller so scanning cannot exceed a single page table.
diff --git a/mm/madvise.c b/mm/madvise.c
index e61e32b2cd91f..661bb743d2216 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -347,10 +347,9 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
 					  pte_t pte, bool *any_young,
 					  bool *any_dirty)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	int max_nr = (end - addr) / PAGE_SIZE;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
 			       any_young, any_dirty);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 0f9b32a20e5b7..ab2d6c1425691 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -990,10 +990,10 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	 * by keeping the batching logic separate.
 	 */
 	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
-		if (src_vma->vm_flags & VM_SHARED)
-			flags |= FPB_IGNORE_DIRTY;
-		if (!vma_soft_dirty_enabled(src_vma))
-			flags |= FPB_IGNORE_SOFT_DIRTY;
+		if (!(src_vma->vm_flags & VM_SHARED))
+			flags |= FPB_HONOR_DIRTY;
+		if (vma_soft_dirty_enabled(src_vma))
+			flags |= FPB_HONOR_SOFT_DIRTY;
 
 		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
 				     &any_writable, NULL, NULL);
@@ -1535,7 +1535,6 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 		struct zap_details *details, int *rss, bool *force_flush,
 		bool *force_break, bool *any_skipped)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct mm_struct *mm = tlb->mm;
 	struct folio *folio;
 	struct page *page;
@@ -1565,7 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	 * by keeping the batching logic separate.
 	 */
 	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
-		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
+		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
 				     NULL, NULL, NULL);
 
 		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1ff7b2174eb77..2a25eedc3b1c0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -675,7 +675,6 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
 static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct vm_area_struct *vma = walk->vma;
 	struct folio *folio;
 	struct queue_pages *qp = walk->private;
@@ -713,8 +712,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		if (folio_test_large(folio) && max_nr != 1)
 			nr = folio_pte_batch(folio, addr, pte, ptent,
-					     max_nr, fpb_flags,
-					     NULL, NULL, NULL);
+					     max_nr, 0, NULL, NULL, NULL);
 		/*
 		 * vm_normal_folio() filters out zero pages, but there might
 		 * still be reserved folios to skip, perhaps in a VDSO.
diff --git a/mm/mlock.c b/mm/mlock.c
index 3cb72b579ffd3..2238cdc5eb1c1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -307,14 +307,13 @@ void munlock_folio(struct folio *folio)
 static inline unsigned int folio_mlock_step(struct folio *folio,
 		pte_t *pte, unsigned long addr, unsigned long end)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	unsigned int count = (end - addr) >> PAGE_SHIFT;
 	pte_t ptent = ptep_get(pte);
 
 	if (!folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, pte, ptent, count, fpb_flags, NULL,
+	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
 			       NULL, NULL);
 }
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 36585041c760d..d4d3ffc931502 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -173,7 +173,6 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
 		pte_t *ptep, pte_t pte, int max_nr)
 {
-	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct folio *folio;
 
 	if (max_nr == 1)
@@ -183,7 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
 	if (!folio || !folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
 			       NULL, NULL);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 3b74bb19c11dd..a29d7d29c7283 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1849,7 +1849,6 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
 static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
 			struct folio *folio, pte_t *ptep)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	int max_nr = folio_nr_pages(folio);
 	pte_t pte = ptep_get(ptep);
 
@@ -1860,7 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
 	if (pte_pfn(pte) != folio_pfn(folio))
 		return false;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
 			       NULL, NULL) == max_nr;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-06-27 11:55 [PATCH v1 0/4] mm: folio_pte_batch() improvements David Hildenbrand
  2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
@ 2025-06-27 11:55 ` David Hildenbrand
  2025-06-27 13:58   ` Lance Yang
                     ` (3 more replies)
  2025-06-27 11:55 ` [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext() David Hildenbrand
  2025-06-27 11:55 ` [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
  3 siblings, 4 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 11:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

Let's clean up a bit:

(1) No need for start_ptep vs. ptep anymore, we can simply use ptep

(2) Let's switch to "unsigned int" for everything

(3) We can simplify the code by leaving the pte unchanged after the
    pte_same() check.

(4) Clarify that we should never exceed a single VMA; it indicates a
    problem in the caller.

No functional change intended.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9690c75063881..ca6590c6d9eab 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -221,7 +221,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * folio_pte_batch - detect a PTE batch for a large folio
  * @folio: The large folio to detect a PTE batch for.
  * @addr: The user virtual address the first page is mapped at.
- * @start_ptep: Page table pointer for the first entry.
+ * @ptep: Page table pointer for the first entry.
  * @pte: Page table entry for the first page.
  * @max_nr: The maximum number of table entries to consider.
  * @flags: Flags to modify the PTE batch semantics.
@@ -233,24 +233,24 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  *		  first one is dirty.
  *
  * Detect a PTE batch: consecutive (present) PTEs that map consecutive
- * pages of the same large folio.
+ * pages of the same large folio in a single VMA and a single page table.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
  * the accessed bit, writable bit, dirty bit (unless FPB_HONOR_DIRTY is set) and
  * soft-dirty bit (unless FPB_HONOR_SOFT_DIRTY is set).
  *
- * start_ptep must map any page of the folio. max_nr must be at least one and
- * must be limited by the caller so scanning cannot exceed a single page table.
+ * @ptep must map any page of the folio. max_nr must be at least one and
+ * must be limited by the caller so scanning cannot exceed a single VMA and
+ * a single page table.
  *
  * Return: the number of table entries in the batch.
  */
-static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
+static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
+		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
 		bool *any_writable, bool *any_young, bool *any_dirty)
 {
-	pte_t expected_pte, *ptep;
-	bool writable, young, dirty;
-	int nr, cur_nr;
+	unsigned int nr, cur_nr;
+	pte_t expected_pte;
 
 	if (any_writable)
 		*any_writable = false;
@@ -267,29 +267,22 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 	max_nr = min_t(unsigned long, max_nr,
 		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
 
-	nr = pte_batch_hint(start_ptep, pte);
+	nr = pte_batch_hint(ptep, pte);
 	expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
-	ptep = start_ptep + nr;
+	ptep = ptep + nr;
 
 	while (nr < max_nr) {
 		pte = ptep_get(ptep);
-		if (any_writable)
-			writable = !!pte_write(pte);
-		if (any_young)
-			young = !!pte_young(pte);
-		if (any_dirty)
-			dirty = !!pte_dirty(pte);
-		pte = __pte_batch_clear_ignored(pte, flags);
 
-		if (!pte_same(pte, expected_pte))
+		if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
 			break;
 
 		if (any_writable)
-			*any_writable |= writable;
+			*any_writable |= pte_write(pte);
 		if (any_young)
-			*any_young |= young;
+			*any_young |= pte_young(pte);
 		if (any_dirty)
-			*any_dirty |= dirty;
+			*any_dirty |= pte_dirty(pte);
 
 		cur_nr = pte_batch_hint(ptep, pte);
 		expected_pte = pte_advance_pfn(expected_pte, cur_nr);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 11:55 [PATCH v1 0/4] mm: folio_pte_batch() improvements David Hildenbrand
  2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
  2025-06-27 11:55 ` [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
@ 2025-06-27 11:55 ` David Hildenbrand
  2025-06-27 14:19   ` Lance Yang
                     ` (4 more replies)
  2025-06-27 11:55 ` [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
  3 siblings, 5 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 11:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

Many users (including upcoming ones) don't really need the flags etc,
and can live with a function call.

So let's provide a basic, non-inlined folio_pte_batch().

In zap_present_ptes(), where we care about performance, the compiler
already seem to generate a call to a common inlined folio_pte_batch()
variant, shared with fork() code. So calling the new non-inlined variant
should not make a difference.

While at it, drop the "addr" parameter that is unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h  | 11 ++++++++---
 mm/madvise.c   |  4 ++--
 mm/memory.c    |  6 ++----
 mm/mempolicy.c |  3 +--
 mm/mlock.c     |  3 +--
 mm/mremap.c    |  3 +--
 mm/rmap.c      |  3 +--
 mm/util.c      | 29 +++++++++++++++++++++++++++++
 8 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index ca6590c6d9eab..6000b683f68ee 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 }
 
 /**
- * folio_pte_batch - detect a PTE batch for a large folio
+ * folio_pte_batch_ext - detect a PTE batch for a large folio
  * @folio: The large folio to detect a PTE batch for.
- * @addr: The user virtual address the first page is mapped at.
  * @ptep: Page table pointer for the first entry.
  * @pte: Page table entry for the first page.
  * @max_nr: The maximum number of table entries to consider.
@@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * must be limited by the caller so scanning cannot exceed a single VMA and
  * a single page table.
  *
+ * This function will be inlined to optimize based on the input parameters;
+ * consider using folio_pte_batch() instead if applicable.
+ *
  * Return: the number of table entries in the batch.
  */
-static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
+static inline unsigned int folio_pte_batch_ext(struct folio *folio,
 		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
 		bool *any_writable, bool *any_young, bool *any_dirty)
 {
@@ -293,6 +295,9 @@ static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long ad
 	return min(nr, max_nr);
 }
 
+unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
+		unsigned int max_nr);
+
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
  *	 forward or backward by delta
diff --git a/mm/madvise.c b/mm/madvise.c
index 661bb743d2216..9b9c35a398ed0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -349,8 +349,8 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
 {
 	int max_nr = (end - addr) / PAGE_SIZE;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
-			       any_young, any_dirty);
+	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
+				   any_young, any_dirty);
 }
 
 static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
diff --git a/mm/memory.c b/mm/memory.c
index ab2d6c1425691..43d35d6675f2e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -995,7 +995,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		if (vma_soft_dirty_enabled(src_vma))
 			flags |= FPB_HONOR_SOFT_DIRTY;
 
-		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
+		nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
 				     &any_writable, NULL, NULL);
 		folio_ref_add(folio, nr);
 		if (folio_test_anon(folio)) {
@@ -1564,9 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	 * by keeping the batching logic separate.
 	 */
 	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
-		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
-				     NULL, NULL, NULL);
-
+		nr = folio_pte_batch(folio, pte, ptent, max_nr);
 		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
 				       addr, details, rss, force_flush,
 				       force_break, any_skipped);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2a25eedc3b1c0..eb83cff7db8c3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -711,8 +711,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!folio || folio_is_zone_device(folio))
 			continue;
 		if (folio_test_large(folio) && max_nr != 1)
-			nr = folio_pte_batch(folio, addr, pte, ptent,
-					     max_nr, 0, NULL, NULL, NULL);
+			nr = folio_pte_batch(folio, pte, ptent, max_nr);
 		/*
 		 * vm_normal_folio() filters out zero pages, but there might
 		 * still be reserved folios to skip, perhaps in a VDSO.
diff --git a/mm/mlock.c b/mm/mlock.c
index 2238cdc5eb1c1..a1d93ad33c6db 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -313,8 +313,7 @@ static inline unsigned int folio_mlock_step(struct folio *folio,
 	if (!folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
-			       NULL, NULL);
+	return folio_pte_batch(folio, pte, ptent, count);
 }
 
 static inline bool allow_mlock_munlock(struct folio *folio,
diff --git a/mm/mremap.c b/mm/mremap.c
index d4d3ffc931502..1f5bebbb9c0cb 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -182,8 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
 	if (!folio || !folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
-			       NULL, NULL);
+	return folio_pte_batch(folio, ptep, pte, max_nr);
 }
 
 static int move_ptes(struct pagetable_move_control *pmc,
diff --git a/mm/rmap.c b/mm/rmap.c
index a29d7d29c7283..6658968600b72 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1859,8 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
 	if (pte_pfn(pte) != folio_pfn(folio))
 		return false;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
-			       NULL, NULL) == max_nr;
+	return folio_pte_batch(folio, ptep, pte, max_nr);
 }
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 0b270c43d7d12..d29dcc135ad28 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1171,3 +1171,32 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 EXPORT_SYMBOL(compat_vma_mmap_prepare);
+
+#ifdef CONFIG_MMU
+/**
+ * folio_pte_batch - detect a PTE batch for a large folio
+ * @folio: The large folio to detect a PTE batch for.
+ * @ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @max_nr: The maximum number of table entries to consider.
+ *
+ * This is a simplified variant of folio_pte_batch_ext().
+ *
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive
+ * pages of the same large folio in a single VMA and a single page table.
+ *
+ * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
+ * the accessed bit, writable bit, dirt-bit and soft-dirty bit.
+ *
+ * ptep must map any page of the folio. max_nr must be at least one and
+ * must be limited by the caller so scanning cannot exceed a single VMA and
+ * a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
+		unsigned int max_nr)
+{
+	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
+}
+#endif /* CONFIG_MMU */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-27 11:55 [PATCH v1 0/4] mm: folio_pte_batch() improvements David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-06-27 11:55 ` [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext() David Hildenbrand
@ 2025-06-27 11:55 ` David Hildenbrand
  2025-06-27 14:34   ` Lance Yang
                     ` (2 more replies)
  3 siblings, 3 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 11:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

Instead, let's just allow for specifying through flags whether we want
to have bits merged into the original PTE.

For the madvise() case, simplify by having only a single parameter for
merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
dirty bit is not required, but also not harmful. This code is not that
performance critical after all to really force all micro-optimizations.

As we now have two pte_t * parameters, use PageTable() to make sure we
are actually given a pointer at a copy of the PTE, not a pointer into
an actual page table.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h | 58 +++++++++++++++++++++++++++++++--------------------
 mm/madvise.c  | 26 +++++------------------
 mm/memory.c   |  8 ++-----
 mm/util.c     |  2 +-
 4 files changed, 43 insertions(+), 51 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 6000b683f68ee..fe69e21b34a24 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
 /* Compare PTEs honoring the soft-dirty bit. */
 #define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
 
+/*
+ * Merge PTE write bits: if any PTE in the batch is writable, modify the
+ * PTE at @ptentp to be writable.
+ */
+#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
+
+/*
+ * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
+ * modify the PTE at @ptentp to be young or dirty, respectively.
+ */
+#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
+
 static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 {
 	if (!(flags & FPB_HONOR_DIRTY))
@@ -220,16 +232,11 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 /**
  * folio_pte_batch_ext - detect a PTE batch for a large folio
  * @folio: The large folio to detect a PTE batch for.
+ * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
  * @ptep: Page table pointer for the first entry.
- * @pte: Page table entry for the first page.
+ * @ptentp: Pointer at a copy of the first page table entry.
  * @max_nr: The maximum number of table entries to consider.
  * @flags: Flags to modify the PTE batch semantics.
- * @any_writable: Optional pointer to indicate whether any entry except the
- *		  first one is writable.
- * @any_young: Optional pointer to indicate whether any entry except the
- *		  first one is young.
- * @any_dirty: Optional pointer to indicate whether any entry except the
- *		  first one is dirty.
  *
  * Detect a PTE batch: consecutive (present) PTEs that map consecutive
  * pages of the same large folio in a single VMA and a single page table.
@@ -242,28 +249,26 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * must be limited by the caller so scanning cannot exceed a single VMA and
  * a single page table.
  *
+ * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
+ * be modified.
+ *
  * This function will be inlined to optimize based on the input parameters;
  * consider using folio_pte_batch() instead if applicable.
  *
  * Return: the number of table entries in the batch.
  */
 static inline unsigned int folio_pte_batch_ext(struct folio *folio,
-		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
-		bool *any_writable, bool *any_young, bool *any_dirty)
+		struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
+		unsigned int max_nr, fpb_t flags)
 {
+	bool any_writable = false, any_young = false, any_dirty = false;
+	pte_t expected_pte, pte = *ptentp;
 	unsigned int nr, cur_nr;
-	pte_t expected_pte;
-
-	if (any_writable)
-		*any_writable = false;
-	if (any_young)
-		*any_young = false;
-	if (any_dirty)
-		*any_dirty = false;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
 	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
 
 	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
 	max_nr = min_t(unsigned long, max_nr,
@@ -279,12 +284,12 @@ static inline unsigned int folio_pte_batch_ext(struct folio *folio,
 		if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
 			break;
 
-		if (any_writable)
-			*any_writable |= pte_write(pte);
-		if (any_young)
-			*any_young |= pte_young(pte);
-		if (any_dirty)
-			*any_dirty |= pte_dirty(pte);
+		if (flags & FPB_MERGE_WRITE)
+			any_writable |= pte_write(pte);
+		if (flags & FPB_MERGE_YOUNG_DIRTY) {
+			any_young |= pte_young(pte);
+			any_dirty |= pte_dirty(pte);
+		}
 
 		cur_nr = pte_batch_hint(ptep, pte);
 		expected_pte = pte_advance_pfn(expected_pte, cur_nr);
@@ -292,6 +297,13 @@ static inline unsigned int folio_pte_batch_ext(struct folio *folio,
 		nr += cur_nr;
 	}
 
+	if (any_writable)
+		*ptentp = pte_mkwrite(*ptentp, vma);
+	if (any_young)
+		*ptentp = pte_mkyoung(*ptentp);
+	if (any_dirty)
+		*ptentp = pte_mkdirty(*ptentp);
+
 	return min(nr, max_nr);
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 9b9c35a398ed0..dce8f5e8555cb 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -344,13 +344,12 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
 
 static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
 					  struct folio *folio, pte_t *ptep,
-					  pte_t pte, bool *any_young,
-					  bool *any_dirty)
+					  pte_t *ptentp)
 {
 	int max_nr = (end - addr) / PAGE_SIZE;
 
-	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
-				   any_young, any_dirty);
+	return folio_pte_batch_ext(folio, NULL, ptep, ptentp, max_nr,
+				   FPB_MERGE_YOUNG_DIRTY);
 }
 
 static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
@@ -488,13 +487,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		 * next pte in the range.
 		 */
 		if (folio_test_large(folio)) {
-			bool any_young;
-
-			nr = madvise_folio_pte_batch(addr, end, folio, pte,
-						     ptent, &any_young, NULL);
-			if (any_young)
-				ptent = pte_mkyoung(ptent);
-
+			nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent);
 			if (nr < folio_nr_pages(folio)) {
 				int err;
 
@@ -724,11 +717,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 * next pte in the range.
 		 */
 		if (folio_test_large(folio)) {
-			bool any_young, any_dirty;
-
-			nr = madvise_folio_pte_batch(addr, end, folio, pte,
-						     ptent, &any_young, &any_dirty);
-
+			nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent);
 			if (nr < folio_nr_pages(folio)) {
 				int err;
 
@@ -753,11 +742,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 					nr = 0;
 				continue;
 			}
-
-			if (any_young)
-				ptent = pte_mkyoung(ptent);
-			if (any_dirty)
-				ptent = pte_mkdirty(ptent);
 		}
 
 		if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
diff --git a/mm/memory.c b/mm/memory.c
index 43d35d6675f2e..985d09bee44fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -972,10 +972,9 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		 pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
 		 int max_nr, int *rss, struct folio **prealloc)
 {
+	fpb_t flags = FPB_MERGE_WRITE;
 	struct page *page;
 	struct folio *folio;
-	bool any_writable;
-	fpb_t flags = 0;
 	int err, nr;
 
 	page = vm_normal_page(src_vma, addr, pte);
@@ -995,8 +994,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		if (vma_soft_dirty_enabled(src_vma))
 			flags |= FPB_HONOR_SOFT_DIRTY;
 
-		nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
-				     &any_writable, NULL, NULL);
+		nr = folio_pte_batch_ext(folio, src_vma, src_pte, &pte, max_nr, flags);
 		folio_ref_add(folio, nr);
 		if (folio_test_anon(folio)) {
 			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
@@ -1010,8 +1008,6 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 			folio_dup_file_rmap_ptes(folio, page, nr, dst_vma);
 			rss[mm_counter_file(folio)] += nr;
 		}
-		if (any_writable)
-			pte = pte_mkwrite(pte, src_vma);
 		__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
 				    addr, nr);
 		return nr;
diff --git a/mm/util.c b/mm/util.c
index d29dcc135ad28..19d1a5814fac7 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1197,6 +1197,6 @@ EXPORT_SYMBOL(compat_vma_mmap_prepare);
 unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
 		unsigned int max_nr)
 {
-	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
+	return folio_pte_batch_ext(folio, NULL, ptep, &pte, max_nr, 0);
 }
 #endif /* CONFIG_MMU */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
@ 2025-06-27 13:40   ` Lance Yang
  2025-06-27 16:28   ` Lorenzo Stoakes
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 58+ messages in thread
From: Lance Yang @ 2025-06-27 13:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> Honoring these PTE bits is the exception, so let's invert the meaning.
>
> With this change, most callers don't have to pass any flags.
>
> No functional change intended.
>

LGTM. Feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>

Thanks,
Lance

> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h  | 16 ++++++++--------
>  mm/madvise.c   |  3 +--
>  mm/memory.c    | 11 +++++------
>  mm/mempolicy.c |  4 +---
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  7 files changed, 18 insertions(+), 25 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e84217e27778d..9690c75063881 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
>  /* Flags for folio_pte_batch(). */
>  typedef int __bitwise fpb_t;
>
> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> -#define FPB_IGNORE_DIRTY               ((__force fpb_t)BIT(0))
> +/* Compare PTEs honoring the dirty bit. */
> +#define FPB_HONOR_DIRTY                ((__force fpb_t)BIT(0))
>
> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> -#define FPB_IGNORE_SOFT_DIRTY          ((__force fpb_t)BIT(1))
> +/* Compare PTEs honoring the soft-dirty bit. */
> +#define FPB_HONOR_SOFT_DIRTY           ((__force fpb_t)BIT(1))
>
>  static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  {
> -       if (flags & FPB_IGNORE_DIRTY)
> +       if (!(flags & FPB_HONOR_DIRTY))
>                 pte = pte_mkclean(pte);
> -       if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
> +       if (likely(!(flags & FPB_HONOR_SOFT_DIRTY)))
>                 pte = pte_clear_soft_dirty(pte);
>         return pte_wrprotect(pte_mkold(pte));
>  }
> @@ -236,8 +236,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * pages of the same large folio.
>   *
>   * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> - * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> - * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> + * the accessed bit, writable bit, dirty bit (unless FPB_HONOR_DIRTY is set) and
> + * soft-dirty bit (unless FPB_HONOR_SOFT_DIRTY is set).
>   *
>   * start_ptep must map any page of the folio. max_nr must be at least one and
>   * must be limited by the caller so scanning cannot exceed a single page table.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e61e32b2cd91f..661bb743d2216 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -347,10 +347,9 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>                                           pte_t pte, bool *any_young,
>                                           bool *any_dirty)
>  {
> -       const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>         int max_nr = (end - addr) / PAGE_SIZE;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>                                any_young, any_dirty);
>  }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b7..ab2d6c1425691 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -990,10 +990,10 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>          * by keeping the batching logic separate.
>          */
>         if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
> -               if (src_vma->vm_flags & VM_SHARED)
> -                       flags |= FPB_IGNORE_DIRTY;
> -               if (!vma_soft_dirty_enabled(src_vma))
> -                       flags |= FPB_IGNORE_SOFT_DIRTY;
> +               if (!(src_vma->vm_flags & VM_SHARED))
> +                       flags |= FPB_HONOR_DIRTY;
> +               if (vma_soft_dirty_enabled(src_vma))
> +                       flags |= FPB_HONOR_SOFT_DIRTY;
>
>                 nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
>                                      &any_writable, NULL, NULL);
> @@ -1535,7 +1535,6 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>                 struct zap_details *details, int *rss, bool *force_flush,
>                 bool *force_break, bool *any_skipped)
>  {
> -       const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>         struct mm_struct *mm = tlb->mm;
>         struct folio *folio;
>         struct page *page;
> @@ -1565,7 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>          * by keeping the batching logic separate.
>          */
>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> -               nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> +               nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
>                                      NULL, NULL, NULL);
>
>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 1ff7b2174eb77..2a25eedc3b1c0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -675,7 +675,6 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
>  static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>                         unsigned long end, struct mm_walk *walk)
>  {
> -       const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>         struct vm_area_struct *vma = walk->vma;
>         struct folio *folio;
>         struct queue_pages *qp = walk->private;
> @@ -713,8 +712,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>                         continue;
>                 if (folio_test_large(folio) && max_nr != 1)
>                         nr = folio_pte_batch(folio, addr, pte, ptent,
> -                                            max_nr, fpb_flags,
> -                                            NULL, NULL, NULL);
> +                                            max_nr, 0, NULL, NULL, NULL);
>                 /*
>                  * vm_normal_folio() filters out zero pages, but there might
>                  * still be reserved folios to skip, perhaps in a VDSO.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 3cb72b579ffd3..2238cdc5eb1c1 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -307,14 +307,13 @@ void munlock_folio(struct folio *folio)
>  static inline unsigned int folio_mlock_step(struct folio *folio,
>                 pte_t *pte, unsigned long addr, unsigned long end)
>  {
> -       const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>         unsigned int count = (end - addr) >> PAGE_SHIFT;
>         pte_t ptent = ptep_get(pte);
>
>         if (!folio_test_large(folio))
>                 return 1;
>
> -       return folio_pte_batch(folio, addr, pte, ptent, count, fpb_flags, NULL,
> +       return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
>                                NULL, NULL);
>  }
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 36585041c760d..d4d3ffc931502 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -173,7 +173,6 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>                 pte_t *ptep, pte_t pte, int max_nr)
>  {
> -       const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>         struct folio *folio;
>
>         if (max_nr == 1)
> @@ -183,7 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>         if (!folio || !folio_test_large(folio))
>                 return 1;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>                                NULL, NULL);
>  }
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 3b74bb19c11dd..a29d7d29c7283 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1849,7 +1849,6 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
>  static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>                         struct folio *folio, pte_t *ptep)
>  {
> -       const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>         int max_nr = folio_nr_pages(folio);
>         pte_t pte = ptep_get(ptep);
>
> @@ -1860,7 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>         if (pte_pfn(pte) != folio_pfn(folio))
>                 return false;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
> +       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>                                NULL, NULL) == max_nr;
>  }
>
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-06-27 11:55 ` [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
@ 2025-06-27 13:58   ` Lance Yang
  2025-06-27 16:51   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 58+ messages in thread
From: Lance Yang @ 2025-06-27 13:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> Let's clean up a bit:
>
> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
>
> (2) Let's switch to "unsigned int" for everything
>
> (3) We can simplify the code by leaving the pte unchanged after the
>     pte_same() check.
>
> (4) Clarify that we should never exceed a single VMA; it indicates a
>     problem in the caller.
>
> No functional change intended.

Nice cleanup! Simplifying the loop and removing the temporary variables
makes the code much easier to follow.

Also, clarifying the caller's responsibility to stay within a single VMA
and page table is a nice change ;)

Feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>

Thanks,
Lance

>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 9690c75063881..ca6590c6d9eab 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -221,7 +221,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * folio_pte_batch - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
>   * @addr: The user virtual address the first page is mapped at.
> - * @start_ptep: Page table pointer for the first entry.
> + * @ptep: Page table pointer for the first entry.
>   * @pte: Page table entry for the first page.
>   * @max_nr: The maximum number of table entries to consider.
>   * @flags: Flags to modify the PTE batch semantics.
> @@ -233,24 +233,24 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   *               first one is dirty.
>   *
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same large folio.
> + * pages of the same large folio in a single VMA and a single page table.
>   *
>   * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
>   * the accessed bit, writable bit, dirty bit (unless FPB_HONOR_DIRTY is set) and
>   * soft-dirty bit (unless FPB_HONOR_SOFT_DIRTY is set).
>   *
> - * start_ptep must map any page of the folio. max_nr must be at least one and
> - * must be limited by the caller so scanning cannot exceed a single page table.
> + * @ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single VMA and
> + * a single page table.
>   *
>   * Return: the number of table entries in the batch.
>   */
> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> -               pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> +static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> +               pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>                 bool *any_writable, bool *any_young, bool *any_dirty)
>  {
> -       pte_t expected_pte, *ptep;
> -       bool writable, young, dirty;
> -       int nr, cur_nr;
> +       unsigned int nr, cur_nr;
> +       pte_t expected_pte;
>
>         if (any_writable)
>                 *any_writable = false;
> @@ -267,29 +267,22 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>         max_nr = min_t(unsigned long, max_nr,
>                        folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
>
> -       nr = pte_batch_hint(start_ptep, pte);
> +       nr = pte_batch_hint(ptep, pte);
>         expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> -       ptep = start_ptep + nr;
> +       ptep = ptep + nr;
>
>         while (nr < max_nr) {
>                 pte = ptep_get(ptep);
> -               if (any_writable)
> -                       writable = !!pte_write(pte);
> -               if (any_young)
> -                       young = !!pte_young(pte);
> -               if (any_dirty)
> -                       dirty = !!pte_dirty(pte);
> -               pte = __pte_batch_clear_ignored(pte, flags);
>
> -               if (!pte_same(pte, expected_pte))
> +               if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
>                         break;
>
>                 if (any_writable)
> -                       *any_writable |= writable;
> +                       *any_writable |= pte_write(pte);
>                 if (any_young)
> -                       *any_young |= young;
> +                       *any_young |= pte_young(pte);
>                 if (any_dirty)
> -                       *any_dirty |= dirty;
> +                       *any_dirty |= pte_dirty(pte);
>
>                 cur_nr = pte_batch_hint(ptep, pte);
>                 expected_pte = pte_advance_pfn(expected_pte, cur_nr);
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 11:55 ` [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext() David Hildenbrand
@ 2025-06-27 14:19   ` Lance Yang
  2025-06-27 15:09     ` David Hildenbrand
  2025-06-27 18:48   ` Lorenzo Stoakes
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 58+ messages in thread
From: Lance Yang @ 2025-06-27 14:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
>
> So let's provide a basic, non-inlined folio_pte_batch().
>
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.

It's always an interesting dance with the compiler when it comes to inlining,
isn't it? We want the speed of 'inline' for critical paths, but also a compact
binary for the common case ...

This split is a nice solution to the classic 'inline' vs. code size dilemma ;p

Thanks,
Lance

>
> While at it, drop the "addr" parameter that is unused.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h  | 11 ++++++++---
>  mm/madvise.c   |  4 ++--
>  mm/memory.c    |  6 ++----
>  mm/mempolicy.c |  3 +--
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  mm/util.c      | 29 +++++++++++++++++++++++++++++
>  8 files changed, 45 insertions(+), 17 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index ca6590c6d9eab..6000b683f68ee 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  }
>
>  /**
> - * folio_pte_batch - detect a PTE batch for a large folio
> + * folio_pte_batch_ext - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
> - * @addr: The user virtual address the first page is mapped at.
>   * @ptep: Page table pointer for the first entry.
>   * @pte: Page table entry for the first page.
>   * @max_nr: The maximum number of table entries to consider.
> @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * must be limited by the caller so scanning cannot exceed a single VMA and
>   * a single page table.
>   *
> + * This function will be inlined to optimize based on the input parameters;
> + * consider using folio_pte_batch() instead if applicable.
> + *
>   * Return: the number of table entries in the batch.
>   */
> -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> +static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>                 pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>                 bool *any_writable, bool *any_young, bool *any_dirty)
>  {
> @@ -293,6 +295,9 @@ static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long ad
>         return min(nr, max_nr);
>  }
>
> +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
> +               unsigned int max_nr);
> +
>  /**
>   * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>   *      forward or backward by delta
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 661bb743d2216..9b9c35a398ed0 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -349,8 +349,8 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>  {
>         int max_nr = (end - addr) / PAGE_SIZE;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -                              any_young, any_dirty);
> +       return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
> +                                  any_young, any_dirty);
>  }
>
>  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> diff --git a/mm/memory.c b/mm/memory.c
> index ab2d6c1425691..43d35d6675f2e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -995,7 +995,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>                 if (vma_soft_dirty_enabled(src_vma))
>                         flags |= FPB_HONOR_SOFT_DIRTY;
>
> -               nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
> +               nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
>                                      &any_writable, NULL, NULL);
>                 folio_ref_add(folio, nr);
>                 if (folio_test_anon(folio)) {
> @@ -1564,9 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>          * by keeping the batching logic separate.
>          */
>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> -               nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
> -                                    NULL, NULL, NULL);
> -
> +               nr = folio_pte_batch(folio, pte, ptent, max_nr);
>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>                                        addr, details, rss, force_flush,
>                                        force_break, any_skipped);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 2a25eedc3b1c0..eb83cff7db8c3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -711,8 +711,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>                 if (!folio || folio_is_zone_device(folio))
>                         continue;
>                 if (folio_test_large(folio) && max_nr != 1)
> -                       nr = folio_pte_batch(folio, addr, pte, ptent,
> -                                            max_nr, 0, NULL, NULL, NULL);
> +                       nr = folio_pte_batch(folio, pte, ptent, max_nr);
>                 /*
>                  * vm_normal_folio() filters out zero pages, but there might
>                  * still be reserved folios to skip, perhaps in a VDSO.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 2238cdc5eb1c1..a1d93ad33c6db 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -313,8 +313,7 @@ static inline unsigned int folio_mlock_step(struct folio *folio,
>         if (!folio_test_large(folio))
>                 return 1;
>
> -       return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
> -                              NULL, NULL);
> +       return folio_pte_batch(folio, pte, ptent, count);
>  }
>
>  static inline bool allow_mlock_munlock(struct folio *folio,
> diff --git a/mm/mremap.c b/mm/mremap.c
> index d4d3ffc931502..1f5bebbb9c0cb 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -182,8 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>         if (!folio || !folio_test_large(folio))
>                 return 1;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -                              NULL, NULL);
> +       return folio_pte_batch(folio, ptep, pte, max_nr);
>  }
>
>  static int move_ptes(struct pagetable_move_control *pmc,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a29d7d29c7283..6658968600b72 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1859,8 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>         if (pte_pfn(pte) != folio_pfn(folio))
>                 return false;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -                              NULL, NULL) == max_nr;
> +       return folio_pte_batch(folio, ptep, pte, max_nr);
>  }
>
>  /*
> diff --git a/mm/util.c b/mm/util.c
> index 0b270c43d7d12..d29dcc135ad28 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1171,3 +1171,32 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
>         return 0;
>  }
>  EXPORT_SYMBOL(compat_vma_mmap_prepare);
> +
> +#ifdef CONFIG_MMU
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + *
> + * This is a simplified variant of folio_pte_batch_ext().
> + *
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> + * pages of the same large folio in a single VMA and a single page table.
> + *
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> + * the accessed bit, writable bit, dirt-bit and soft-dirty bit.
> + *
> + * ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single VMA and
> + * a single page table.
> + *
> + * Return: the number of table entries in the batch.
> + */
> +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
> +               unsigned int max_nr)
> +{
> +       return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
> +}
> +#endif /* CONFIG_MMU */
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-27 11:55 ` [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
@ 2025-06-27 14:34   ` Lance Yang
  2025-06-27 15:11     ` David Hildenbrand
  2025-06-27 19:04   ` Lorenzo Stoakes
  2025-06-30 17:59   ` Zi Yan
  2 siblings, 1 reply; 58+ messages in thread
From: Lance Yang @ 2025-06-27 14:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> Instead, let's just allow for specifying through flags whether we want
> to have bits merged into the original PTE.
>
> For the madvise() case, simplify by having only a single parameter for
> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
> dirty bit is not required, but also not harmful. This code is not that
> performance critical after all to really force all micro-optimizations.

IIRC, this work you've wanted to do for a long time - maybe even a year ago?

Less conditional logic is always a good thing ;)

Thanks,
Lance

>
> As we now have two pte_t * parameters, use PageTable() to make sure we
> are actually given a pointer at a copy of the PTE, not a pointer into
> an actual page table.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h | 58 +++++++++++++++++++++++++++++++--------------------
>  mm/madvise.c  | 26 +++++------------------
>  mm/memory.c   |  8 ++-----
>  mm/util.c     |  2 +-
>  4 files changed, 43 insertions(+), 51 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 6000b683f68ee..fe69e21b34a24 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
>  /* Compare PTEs honoring the soft-dirty bit. */
>  #define FPB_HONOR_SOFT_DIRTY           ((__force fpb_t)BIT(1))
>
> +/*
> + * Merge PTE write bits: if any PTE in the batch is writable, modify the
> + * PTE at @ptentp to be writable.
> + */
> +#define FPB_MERGE_WRITE                        ((__force fpb_t)BIT(2))
> +
> +/*
> + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
> + * modify the PTE at @ptentp to be young or dirty, respectively.
> + */
> +#define FPB_MERGE_YOUNG_DIRTY          ((__force fpb_t)BIT(3))
> +
>  static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  {
>         if (!(flags & FPB_HONOR_DIRTY))
> @@ -220,16 +232,11 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  /**
>   * folio_pte_batch_ext - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
> + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>   * @ptep: Page table pointer for the first entry.
> - * @pte: Page table entry for the first page.
> + * @ptentp: Pointer at a copy of the first page table entry.
>   * @max_nr: The maximum number of table entries to consider.
>   * @flags: Flags to modify the PTE batch semantics.
> - * @any_writable: Optional pointer to indicate whether any entry except the
> - *               first one is writable.
> - * @any_young: Optional pointer to indicate whether any entry except the
> - *               first one is young.
> - * @any_dirty: Optional pointer to indicate whether any entry except the
> - *               first one is dirty.
>   *
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>   * pages of the same large folio in a single VMA and a single page table.
> @@ -242,28 +249,26 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * must be limited by the caller so scanning cannot exceed a single VMA and
>   * a single page table.
>   *
> + * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
> + * be modified.
> + *
>   * This function will be inlined to optimize based on the input parameters;
>   * consider using folio_pte_batch() instead if applicable.
>   *
>   * Return: the number of table entries in the batch.
>   */
>  static inline unsigned int folio_pte_batch_ext(struct folio *folio,
> -               pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
> -               bool *any_writable, bool *any_young, bool *any_dirty)
> +               struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
> +               unsigned int max_nr, fpb_t flags)
>  {
> +       bool any_writable = false, any_young = false, any_dirty = false;
> +       pte_t expected_pte, pte = *ptentp;
>         unsigned int nr, cur_nr;
> -       pte_t expected_pte;
> -
> -       if (any_writable)
> -               *any_writable = false;
> -       if (any_young)
> -               *any_young = false;
> -       if (any_dirty)
> -               *any_dirty = false;
>
>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>         VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>         VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
> +       VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
>
>         /* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
>         max_nr = min_t(unsigned long, max_nr,
> @@ -279,12 +284,12 @@ static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>                 if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
>                         break;
>
> -               if (any_writable)
> -                       *any_writable |= pte_write(pte);
> -               if (any_young)
> -                       *any_young |= pte_young(pte);
> -               if (any_dirty)
> -                       *any_dirty |= pte_dirty(pte);
> +               if (flags & FPB_MERGE_WRITE)
> +                       any_writable |= pte_write(pte);
> +               if (flags & FPB_MERGE_YOUNG_DIRTY) {
> +                       any_young |= pte_young(pte);
> +                       any_dirty |= pte_dirty(pte);
> +               }
>
>                 cur_nr = pte_batch_hint(ptep, pte);
>                 expected_pte = pte_advance_pfn(expected_pte, cur_nr);
> @@ -292,6 +297,13 @@ static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>                 nr += cur_nr;
>         }
>
> +       if (any_writable)
> +               *ptentp = pte_mkwrite(*ptentp, vma);
> +       if (any_young)
> +               *ptentp = pte_mkyoung(*ptentp);
> +       if (any_dirty)
> +               *ptentp = pte_mkdirty(*ptentp);
> +
>         return min(nr, max_nr);
>  }
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9b9c35a398ed0..dce8f5e8555cb 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -344,13 +344,12 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
>
>  static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>                                           struct folio *folio, pte_t *ptep,
> -                                         pte_t pte, bool *any_young,
> -                                         bool *any_dirty)
> +                                         pte_t *ptentp)
>  {
>         int max_nr = (end - addr) / PAGE_SIZE;
>
> -       return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
> -                                  any_young, any_dirty);
> +       return folio_pte_batch_ext(folio, NULL, ptep, ptentp, max_nr,
> +                                  FPB_MERGE_YOUNG_DIRTY);
>  }
>
>  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> @@ -488,13 +487,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                  * next pte in the range.
>                  */
>                 if (folio_test_large(folio)) {
> -                       bool any_young;
> -
> -                       nr = madvise_folio_pte_batch(addr, end, folio, pte,
> -                                                    ptent, &any_young, NULL);
> -                       if (any_young)
> -                               ptent = pte_mkyoung(ptent);
> -
> +                       nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent);
>                         if (nr < folio_nr_pages(folio)) {
>                                 int err;
>
> @@ -724,11 +717,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                  * next pte in the range.
>                  */
>                 if (folio_test_large(folio)) {
> -                       bool any_young, any_dirty;
> -
> -                       nr = madvise_folio_pte_batch(addr, end, folio, pte,
> -                                                    ptent, &any_young, &any_dirty);
> -
> +                       nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent);
>                         if (nr < folio_nr_pages(folio)) {
>                                 int err;
>
> @@ -753,11 +742,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                                         nr = 0;
>                                 continue;
>                         }
> -
> -                       if (any_young)
> -                               ptent = pte_mkyoung(ptent);
> -                       if (any_dirty)
> -                               ptent = pte_mkdirty(ptent);
>                 }
>
>                 if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
> diff --git a/mm/memory.c b/mm/memory.c
> index 43d35d6675f2e..985d09bee44fd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -972,10 +972,9 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>                  pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
>                  int max_nr, int *rss, struct folio **prealloc)
>  {
> +       fpb_t flags = FPB_MERGE_WRITE;
>         struct page *page;
>         struct folio *folio;
> -       bool any_writable;
> -       fpb_t flags = 0;
>         int err, nr;
>
>         page = vm_normal_page(src_vma, addr, pte);
> @@ -995,8 +994,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>                 if (vma_soft_dirty_enabled(src_vma))
>                         flags |= FPB_HONOR_SOFT_DIRTY;
>
> -               nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
> -                                    &any_writable, NULL, NULL);
> +               nr = folio_pte_batch_ext(folio, src_vma, src_pte, &pte, max_nr, flags);
>                 folio_ref_add(folio, nr);
>                 if (folio_test_anon(folio)) {
>                         if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
> @@ -1010,8 +1008,6 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>                         folio_dup_file_rmap_ptes(folio, page, nr, dst_vma);
>                         rss[mm_counter_file(folio)] += nr;
>                 }
> -               if (any_writable)
> -                       pte = pte_mkwrite(pte, src_vma);
>                 __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
>                                     addr, nr);
>                 return nr;
> diff --git a/mm/util.c b/mm/util.c
> index d29dcc135ad28..19d1a5814fac7 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1197,6 +1197,6 @@ EXPORT_SYMBOL(compat_vma_mmap_prepare);
>  unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
>                 unsigned int max_nr)
>  {
> -       return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
> +       return folio_pte_batch_ext(folio, NULL, ptep, &pte, max_nr, 0);
>  }
>  #endif /* CONFIG_MMU */
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 14:19   ` Lance Yang
@ 2025-06-27 15:09     ` David Hildenbrand
  2025-06-27 15:45       ` Lance Yang
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 15:09 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 27.06.25 16:19, Lance Yang wrote:
> On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> Many users (including upcoming ones) don't really need the flags etc,
>> and can live with a function call.
>>
>> So let's provide a basic, non-inlined folio_pte_batch().
>>
>> In zap_present_ptes(), where we care about performance, the compiler
>> already seem to generate a call to a common inlined folio_pte_batch()
>> variant, shared with fork() code. So calling the new non-inlined variant
>> should not make a difference.
> 
> It's always an interesting dance with the compiler when it comes to inlining,
> isn't it? We want the speed of 'inline' for critical paths, but also a compact
> binary for the common case ...
> 
> This split is a nice solution to the classic 'inline' vs. code size dilemma ;p

Yeah, in particular when we primarily care about optimizing out all the 
unnecessary checks inside the function, not necessarily also inlining 
the function call itself.

If we ever realize we absolute must inline it into a caller, we could 
turn folio_pte_batch_ext() into an "__always_inline", but for now it 
does not seem like this is really required from my experiments.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-27 14:34   ` Lance Yang
@ 2025-06-27 15:11     ` David Hildenbrand
  2025-06-27 15:40       ` Lance Yang
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 15:11 UTC (permalink / raw)
  To: Lance Yang
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 27.06.25 16:34, Lance Yang wrote:
> On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> Instead, let's just allow for specifying through flags whether we want
>> to have bits merged into the original PTE.
>>
>> For the madvise() case, simplify by having only a single parameter for
>> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
>> dirty bit is not required, but also not harmful. This code is not that
>> performance critical after all to really force all micro-optimizations.
> 
> IIRC, this work you've wanted to do for a long time - maybe even a year ago?

Heh, maybe, I don't remember.

For this user here, I realized that we might already check the existence 
of any_dirty at runtime -- because the compiler will not necessarily 
create two optimized functions.

So we already have runtime checks ... instead of checking whether 
any_dirty == NULL, we now simply do the merging (checking for 
pte_young() instead) now.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-27 15:11     ` David Hildenbrand
@ 2025-06-27 15:40       ` Lance Yang
  0 siblings, 0 replies; 58+ messages in thread
From: Lance Yang @ 2025-06-27 15:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang



On 2025/6/27 23:11, David Hildenbrand wrote:
> On 27.06.25 16:34, Lance Yang wrote:
>> On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> 
>> wrote:
>>>
>>> Instead, let's just allow for specifying through flags whether we want
>>> to have bits merged into the original PTE.
>>>
>>> For the madvise() case, simplify by having only a single parameter for
>>> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
>>> dirty bit is not required, but also not harmful. This code is not that
>>> performance critical after all to really force all micro-optimizations.
>>
>> IIRC, this work you've wanted to do for a long time - maybe even a 
>> year ago?
> 
> Heh, maybe, I don't remember.
> 
> For this user here, I realized that we might already check the existence 
> of any_dirty at runtime -- because the compiler will not necessarily 
> create two optimized functions.

Ah, I see! That's a very sharp observation about the compiler's behavior ;)

> 
> So we already have runtime checks ... instead of checking whether 
> any_dirty == NULL, we now simply do the merging (checking for 
> pte_young() instead) now.

Thanks for the lesson!
Lance


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 15:09     ` David Hildenbrand
@ 2025-06-27 15:45       ` Lance Yang
  0 siblings, 0 replies; 58+ messages in thread
From: Lance Yang @ 2025-06-27 15:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang



On 2025/6/27 23:09, David Hildenbrand wrote:
> On 27.06.25 16:19, Lance Yang wrote:
>> On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> 
>> wrote:
>>>
>>> Many users (including upcoming ones) don't really need the flags etc,
>>> and can live with a function call.
>>>
>>> So let's provide a basic, non-inlined folio_pte_batch().
>>>
>>> In zap_present_ptes(), where we care about performance, the compiler
>>> already seem to generate a call to a common inlined folio_pte_batch()
>>> variant, shared with fork() code. So calling the new non-inlined variant
>>> should not make a difference.
>>
>> It's always an interesting dance with the compiler when it comes to 
>> inlining,
>> isn't it? We want the speed of 'inline' for critical paths, but also a 
>> compact
>> binary for the common case ...
>>
>> This split is a nice solution to the classic 'inline' vs. code size 
>> dilemma ;p
> 
> Yeah, in particular when we primarily care about optimizing out all the 
> unnecessary checks inside the function, not necessarily also inlining 
> the function call itself.
> 
> If we ever realize we absolute must inline it into a caller, we could 
> turn folio_pte_batch_ext() into an "__always_inline", but for now it 
> does not seem like this is really required from my experiments.

Right, that makes sense. No need to force "__always_inline" prematurely.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
  2025-06-27 13:40   ` Lance Yang
@ 2025-06-27 16:28   ` Lorenzo Stoakes
  2025-06-27 16:30     ` David Hildenbrand
  2025-06-28  3:37   ` Dev Jain
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-27 16:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 01:55:07PM +0200, David Hildenbrand wrote:
> Honoring these PTE bits is the exception, so let's invert the meaning.
>
> With this change, most callers don't have to pass any flags.
>
> No functional change intended.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

This is a nice change, it removes a lot of code I really didn't enjoy
looking at for introducing these flags all over the place.

But a nit on the naming below, I'm not a fan of 'honor' here :)

> ---
>  mm/internal.h  | 16 ++++++++--------
>  mm/madvise.c   |  3 +--
>  mm/memory.c    | 11 +++++------
>  mm/mempolicy.c |  4 +---
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  7 files changed, 18 insertions(+), 25 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e84217e27778d..9690c75063881 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
>  /* Flags for folio_pte_batch(). */
>  typedef int __bitwise fpb_t;
>
> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> -#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
> +/* Compare PTEs honoring the dirty bit. */
> +#define FPB_HONOR_DIRTY		((__force fpb_t)BIT(0))

Hm not to be petty but... :)

I'm not sure I find 'honor' very clear here. Ignore is very clear, 'honor' (God
the British English in me wants to say honour here but stipp :P) doesn't
necessarily tell you what is going to happen.

Perhaps PROPAGATE? or OBEY?

>
> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> -#define FPB_IGNORE_SOFT_DIRTY		((__force fpb_t)BIT(1))
> +/* Compare PTEs honoring the soft-dirty bit. */
> +#define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
>
>  static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  {
> -	if (flags & FPB_IGNORE_DIRTY)
> +	if (!(flags & FPB_HONOR_DIRTY))
>  		pte = pte_mkclean(pte);
> -	if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
> +	if (likely(!(flags & FPB_HONOR_SOFT_DIRTY)))
>  		pte = pte_clear_soft_dirty(pte);
>  	return pte_wrprotect(pte_mkold(pte));
>  }
> @@ -236,8 +236,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * pages of the same large folio.
>   *
>   * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> - * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> - * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> + * the accessed bit, writable bit, dirty bit (unless FPB_HONOR_DIRTY is set) and
> + * soft-dirty bit (unless FPB_HONOR_SOFT_DIRTY is set).
>   *
>   * start_ptep must map any page of the folio. max_nr must be at least one and
>   * must be limited by the caller so scanning cannot exceed a single page table.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e61e32b2cd91f..661bb743d2216 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -347,10 +347,9 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>  					  pte_t pte, bool *any_young,
>  					  bool *any_dirty)
>  {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>  	int max_nr = (end - addr) / PAGE_SIZE;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>  			       any_young, any_dirty);
>  }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b7..ab2d6c1425691 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -990,10 +990,10 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  	 * by keeping the batching logic separate.
>  	 */
>  	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
> -		if (src_vma->vm_flags & VM_SHARED)
> -			flags |= FPB_IGNORE_DIRTY;
> -		if (!vma_soft_dirty_enabled(src_vma))
> -			flags |= FPB_IGNORE_SOFT_DIRTY;
> +		if (!(src_vma->vm_flags & VM_SHARED))
> +			flags |= FPB_HONOR_DIRTY;
> +		if (vma_soft_dirty_enabled(src_vma))
> +			flags |= FPB_HONOR_SOFT_DIRTY;
>
>  		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
>  				     &any_writable, NULL, NULL);
> @@ -1535,7 +1535,6 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>  		struct zap_details *details, int *rss, bool *force_flush,
>  		bool *force_break, bool *any_skipped)
>  {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>  	struct mm_struct *mm = tlb->mm;
>  	struct folio *folio;
>  	struct page *page;
> @@ -1565,7 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>  	 * by keeping the batching logic separate.
>  	 */
>  	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> -		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> +		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
>  				     NULL, NULL, NULL);
>
>  		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 1ff7b2174eb77..2a25eedc3b1c0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -675,7 +675,6 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
>  static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>  			unsigned long end, struct mm_walk *walk)
>  {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>  	struct vm_area_struct *vma = walk->vma;
>  	struct folio *folio;
>  	struct queue_pages *qp = walk->private;
> @@ -713,8 +712,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>  			continue;
>  		if (folio_test_large(folio) && max_nr != 1)
>  			nr = folio_pte_batch(folio, addr, pte, ptent,
> -					     max_nr, fpb_flags,
> -					     NULL, NULL, NULL);
> +					     max_nr, 0, NULL, NULL, NULL);
>  		/*
>  		 * vm_normal_folio() filters out zero pages, but there might
>  		 * still be reserved folios to skip, perhaps in a VDSO.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 3cb72b579ffd3..2238cdc5eb1c1 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -307,14 +307,13 @@ void munlock_folio(struct folio *folio)
>  static inline unsigned int folio_mlock_step(struct folio *folio,
>  		pte_t *pte, unsigned long addr, unsigned long end)
>  {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>  	unsigned int count = (end - addr) >> PAGE_SHIFT;
>  	pte_t ptent = ptep_get(pte);
>
>  	if (!folio_test_large(folio))
>  		return 1;
>
> -	return folio_pte_batch(folio, addr, pte, ptent, count, fpb_flags, NULL,
> +	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
>  			       NULL, NULL);
>  }
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 36585041c760d..d4d3ffc931502 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -173,7 +173,6 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>  		pte_t *ptep, pte_t pte, int max_nr)
>  {
> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>  	struct folio *folio;
>
>  	if (max_nr == 1)
> @@ -183,7 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>  	if (!folio || !folio_test_large(folio))
>  		return 1;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>  			       NULL, NULL);
>  }
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 3b74bb19c11dd..a29d7d29c7283 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1849,7 +1849,6 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
>  static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>  			struct folio *folio, pte_t *ptep)
>  {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>  	int max_nr = folio_nr_pages(folio);
>  	pte_t pte = ptep_get(ptep);
>
> @@ -1860,7 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>  	if (pte_pfn(pte) != folio_pfn(folio))
>  		return false;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>  			       NULL, NULL) == max_nr;

I hope a later patch gets rid of this NULL, NULL, NULL... :)

>  }
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 16:28   ` Lorenzo Stoakes
@ 2025-06-27 16:30     ` David Hildenbrand
  2025-06-27 16:33       ` Lorenzo Stoakes
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 16:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 27.06.25 18:28, Lorenzo Stoakes wrote:
> On Fri, Jun 27, 2025 at 01:55:07PM +0200, David Hildenbrand wrote:
>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>
>> With this change, most callers don't have to pass any flags.
>>
>> No functional change intended.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> This is a nice change, it removes a lot of code I really didn't enjoy
> looking at for introducing these flags all over the place.
> 
> But a nit on the naming below, I'm not a fan of 'honor' here :)
> 
>> ---
>>   mm/internal.h  | 16 ++++++++--------
>>   mm/madvise.c   |  3 +--
>>   mm/memory.c    | 11 +++++------
>>   mm/mempolicy.c |  4 +---
>>   mm/mlock.c     |  3 +--
>>   mm/mremap.c    |  3 +--
>>   mm/rmap.c      |  3 +--
>>   7 files changed, 18 insertions(+), 25 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index e84217e27778d..9690c75063881 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
>>   /* Flags for folio_pte_batch(). */
>>   typedef int __bitwise fpb_t;
>>
>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>> -#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
>> +/* Compare PTEs honoring the dirty bit. */
>> +#define FPB_HONOR_DIRTY		((__force fpb_t)BIT(0))
> 
> Hm not to be petty but... :)
> 
> I'm not sure I find 'honor' very clear here. Ignore is very clear, 'honor' (God
> the British English in me wants to say honour here but stipp :P) doesn't
> necessarily tell you what is going to happen.
> 
> Perhaps PROPAGATE? or OBEY?

RESPECT? :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 16:30     ` David Hildenbrand
@ 2025-06-27 16:33       ` Lorenzo Stoakes
  2025-06-29  8:59         ` Mike Rapoport
  0 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-27 16:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 06:30:13PM +0200, David Hildenbrand wrote:
> On 27.06.25 18:28, Lorenzo Stoakes wrote:
> > On Fri, Jun 27, 2025 at 01:55:07PM +0200, David Hildenbrand wrote:
> > > Honoring these PTE bits is the exception, so let's invert the meaning.
> > >
> > > With this change, most callers don't have to pass any flags.
> > >
> > > No functional change intended.
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > This is a nice change, it removes a lot of code I really didn't enjoy
> > looking at for introducing these flags all over the place.
> >
> > But a nit on the naming below, I'm not a fan of 'honor' here :)
> >
> > > ---
> > >   mm/internal.h  | 16 ++++++++--------
> > >   mm/madvise.c   |  3 +--
> > >   mm/memory.c    | 11 +++++------
> > >   mm/mempolicy.c |  4 +---
> > >   mm/mlock.c     |  3 +--
> > >   mm/mremap.c    |  3 +--
> > >   mm/rmap.c      |  3 +--
> > >   7 files changed, 18 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index e84217e27778d..9690c75063881 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
> > >   /* Flags for folio_pte_batch(). */
> > >   typedef int __bitwise fpb_t;
> > >
> > > -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> > > -#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
> > > +/* Compare PTEs honoring the dirty bit. */
> > > +#define FPB_HONOR_DIRTY		((__force fpb_t)BIT(0))
> >
> > Hm not to be petty but... :)
> >
> > I'm not sure I find 'honor' very clear here. Ignore is very clear, 'honor' (God
> > the British English in me wants to say honour here but stipp :P) doesn't
> > necessarily tell you what is going to happen.
> >
> > Perhaps PROPAGATE? or OBEY?
>
> RESPECT? :)

🎵 R-E-S-P-E-C-T find out what it means to me... ;) 🎵

This works too :>)

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-06-27 11:55 ` [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
  2025-06-27 13:58   ` Lance Yang
@ 2025-06-27 16:51   ` Lorenzo Stoakes
  2025-06-27 17:02     ` David Hildenbrand
  2025-06-30 17:40   ` Zi Yan
  2025-07-02  8:42   ` Oscar Salvador
  3 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-27 16:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
> Let's clean up a bit:
>
> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
>
> (2) Let's switch to "unsigned int" for everything
>
> (3) We can simplify the code by leaving the pte unchanged after the
>     pte_same() check.
>
> (4) Clarify that we should never exceed a single VMA; it indicates a
>     problem in the caller.
>
> No functional change intended.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 9690c75063881..ca6590c6d9eab 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -221,7 +221,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * folio_pte_batch - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
>   * @addr: The user virtual address the first page is mapped at.
> - * @start_ptep: Page table pointer for the first entry.
> + * @ptep: Page table pointer for the first entry.
>   * @pte: Page table entry for the first page.
>   * @max_nr: The maximum number of table entries to consider.
>   * @flags: Flags to modify the PTE batch semantics.
> @@ -233,24 +233,24 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   *		  first one is dirty.
>   *
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same large folio.
> + * pages of the same large folio in a single VMA and a single page table.
>   *
>   * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
>   * the accessed bit, writable bit, dirty bit (unless FPB_HONOR_DIRTY is set) and
>   * soft-dirty bit (unless FPB_HONOR_SOFT_DIRTY is set).
>   *
> - * start_ptep must map any page of the folio. max_nr must be at least one and
> - * must be limited by the caller so scanning cannot exceed a single page table.
> + * @ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single VMA and
> + * a single page table.
>   *
>   * Return: the number of table entries in the batch.
>   */
> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> -		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> +static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,

Do we need to worry about propagating this type change?

mremap_folio_pte_batch() and zap_present_ptes() return the value as an int for example.

I mean I doubt we're going to be seeing an overflow here :) but maybe worth
propagating this everywhere.

> +		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>  		bool *any_writable, bool *any_young, bool *any_dirty)
>  {
> -	pte_t expected_pte, *ptep;
> -	bool writable, young, dirty;
> -	int nr, cur_nr;
> +	unsigned int nr, cur_nr;
> +	pte_t expected_pte;
>
>  	if (any_writable)
>  		*any_writable = false;
> @@ -267,29 +267,22 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  	max_nr = min_t(unsigned long, max_nr,
>  		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
>
> -	nr = pte_batch_hint(start_ptep, pte);
> +	nr = pte_batch_hint(ptep, pte);
>  	expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> -	ptep = start_ptep + nr;
> +	ptep = ptep + nr;
>
>  	while (nr < max_nr) {
>  		pte = ptep_get(ptep);
> -		if (any_writable)
> -			writable = !!pte_write(pte);
> -		if (any_young)
> -			young = !!pte_young(pte);
> -		if (any_dirty)
> -			dirty = !!pte_dirty(pte);
> -		pte = __pte_batch_clear_ignored(pte, flags);
>
> -		if (!pte_same(pte, expected_pte))
> +		if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))

Doing this here will change the output of any_writable, any_young:

static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
{
	...
	return pte_wrprotect(pte_mkold(pte));
}

So we probably need to get these values earlier?

>  			break;
>
>  		if (any_writable)
> -			*any_writable |= writable;
> +			*any_writable |= pte_write(pte);
>  		if (any_young)
> -			*any_young |= young;
> +			*any_young |= pte_young(pte);
>  		if (any_dirty)
> -			*any_dirty |= dirty;
> +			*any_dirty |= pte_dirty(pte);
>
>  		cur_nr = pte_batch_hint(ptep, pte);
>  		expected_pte = pte_advance_pfn(expected_pte, cur_nr);
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-06-27 16:51   ` Lorenzo Stoakes
@ 2025-06-27 17:02     ` David Hildenbrand
  2025-06-27 18:39       ` Lorenzo Stoakes
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-27 17:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 27.06.25 18:51, Lorenzo Stoakes wrote:
> On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
>> Let's clean up a bit:
>>
>> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
>>
>> (2) Let's switch to "unsigned int" for everything
>>
>> (3) We can simplify the code by leaving the pte unchanged after the
>>      pte_same() check.
>>
>> (4) Clarify that we should never exceed a single VMA; it indicates a
>>      problem in the caller.
>>
>> No functional change intended.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/internal.h | 37 +++++++++++++++----------------------
>>   1 file changed, 15 insertions(+), 22 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 9690c75063881..ca6590c6d9eab 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -221,7 +221,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>    * folio_pte_batch - detect a PTE batch for a large folio
>>    * @folio: The large folio to detect a PTE batch for.
>>    * @addr: The user virtual address the first page is mapped at.
>> - * @start_ptep: Page table pointer for the first entry.
>> + * @ptep: Page table pointer for the first entry.
>>    * @pte: Page table entry for the first page.
>>    * @max_nr: The maximum number of table entries to consider.
>>    * @flags: Flags to modify the PTE batch semantics.
>> @@ -233,24 +233,24 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>    *		  first one is dirty.
>>    *
>>    * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>> - * pages of the same large folio.
>> + * pages of the same large folio in a single VMA and a single page table.
>>    *
>>    * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
>>    * the accessed bit, writable bit, dirty bit (unless FPB_HONOR_DIRTY is set) and
>>    * soft-dirty bit (unless FPB_HONOR_SOFT_DIRTY is set).
>>    *
>> - * start_ptep must map any page of the folio. max_nr must be at least one and
>> - * must be limited by the caller so scanning cannot exceed a single page table.
>> + * @ptep must map any page of the folio. max_nr must be at least one and
>> + * must be limited by the caller so scanning cannot exceed a single VMA and
>> + * a single page table.
>>    *
>>    * Return: the number of table entries in the batch.
>>    */
>> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>> -		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>> +static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> 
> Do we need to worry about propagating this type change?
> 
> mremap_folio_pte_batch() and zap_present_ptes() return the value as an int for example.
> 
> I mean I doubt we're going to be seeing an overflow here :) but maybe worth
> propagating this everywhere.

Yeah, I'm planning on cleaning some of that stuff separately. As you 
say, this shouldn't cause harm.

Really, it could only cause harm if someone would be doing

"- folio_pte_batch()" and then assigning it to a "long".

> 
>> +		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>>   		bool *any_writable, bool *any_young, bool *any_dirty)
>>   {
>> -	pte_t expected_pte, *ptep;
>> -	bool writable, young, dirty;
>> -	int nr, cur_nr;
>> +	unsigned int nr, cur_nr;
>> +	pte_t expected_pte;
>>
>>   	if (any_writable)
>>   		*any_writable = false;
>> @@ -267,29 +267,22 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>   	max_nr = min_t(unsigned long, max_nr,
>>   		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
>>
>> -	nr = pte_batch_hint(start_ptep, pte);
>> +	nr = pte_batch_hint(ptep, pte);
>>   	expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
>> -	ptep = start_ptep + nr;
>> +	ptep = ptep + nr;
>>
>>   	while (nr < max_nr) {
>>   		pte = ptep_get(ptep);
>> -		if (any_writable)
>> -			writable = !!pte_write(pte);
>> -		if (any_young)
>> -			young = !!pte_young(pte);
>> -		if (any_dirty)
>> -			dirty = !!pte_dirty(pte);
>> -		pte = __pte_batch_clear_ignored(pte, flags);
>>
>> -		if (!pte_same(pte, expected_pte))
>> +		if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
> 
> Doing this here will change the output of any_writable, any_young:
> 
> static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> {
> 	...
> 	return pte_wrprotect(pte_mkold(pte));
> }
> 
> So we probably need to get these values earlier?


Note that __pte_batch_clear_ignored() leaves the pte unchanged, and only 
returns the modified pte. (like pte_mkold() and friends).

So what we read through ptep_get() is still what we have after the 
pte_same() check.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-06-27 17:02     ` David Hildenbrand
@ 2025-06-27 18:39       ` Lorenzo Stoakes
  0 siblings, 0 replies; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-27 18:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 07:02:16PM +0200, David Hildenbrand wrote:
> On 27.06.25 18:51, Lorenzo Stoakes wrote:
> > On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
> > > Let's clean up a bit:
> > >
> > > (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
> > >
> > > (2) Let's switch to "unsigned int" for everything
> > >
> > > (3) We can simplify the code by leaving the pte unchanged after the
> > >      pte_same() check.
> > >
> > > (4) Clarify that we should never exceed a single VMA; it indicates a
> > >      problem in the caller.
> > >
> > > No functional change intended.
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>

Given below, LGTM:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> > > -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> > > -		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> > > +static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> >
> > Do we need to worry about propagating this type change?
> >
> > mremap_folio_pte_batch() and zap_present_ptes() return the value as an int for example.
> >
> > I mean I doubt we're going to be seeing an overflow here :) but maybe worth
> > propagating this everywhere.
>
> Yeah, I'm planning on cleaning some of that stuff separately. As you say,
> this shouldn't cause harm.
>
> Really, it could only cause harm if someone would be doing
>
> "- folio_pte_batch()" and then assigning it to a "long".

Right yeah.

>
> >
> > > +		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
> > >   		bool *any_writable, bool *any_young, bool *any_dirty)
> > >   {
> > > -	pte_t expected_pte, *ptep;
> > > -	bool writable, young, dirty;
> > > -	int nr, cur_nr;
> > > +	unsigned int nr, cur_nr;
> > > +	pte_t expected_pte;
> > >
> > >   	if (any_writable)
> > >   		*any_writable = false;
> > > @@ -267,29 +267,22 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> > >   	max_nr = min_t(unsigned long, max_nr,
> > >   		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
> > >
> > > -	nr = pte_batch_hint(start_ptep, pte);
> > > +	nr = pte_batch_hint(ptep, pte);
> > >   	expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
> > > -	ptep = start_ptep + nr;
> > > +	ptep = ptep + nr;
> > >
> > >   	while (nr < max_nr) {
> > >   		pte = ptep_get(ptep);
> > > -		if (any_writable)
> > > -			writable = !!pte_write(pte);
> > > -		if (any_young)
> > > -			young = !!pte_young(pte);
> > > -		if (any_dirty)
> > > -			dirty = !!pte_dirty(pte);
> > > -		pte = __pte_batch_clear_ignored(pte, flags);
> > >
> > > -		if (!pte_same(pte, expected_pte))
> > > +		if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
> >
> > Doing this here will change the output of any_writable, any_young:
> >
> > static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > {
> > 	...
> > 	return pte_wrprotect(pte_mkold(pte));
> > }
> >
> > So we probably need to get these values earlier?
>
>
> Note that __pte_batch_clear_ignored() leaves the pte unchanged, and only
> returns the modified pte. (like pte_mkold() and friends).
>
> So what we read through ptep_get() is still what we have after the
> pte_same() check.

Yeah you're right, sorry, these are just chaging the value that is returned for
comparison against expected_pte.

Then LGTM!

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 11:55 ` [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext() David Hildenbrand
  2025-06-27 14:19   ` Lance Yang
@ 2025-06-27 18:48   ` Lorenzo Stoakes
  2025-06-30  9:19     ` David Hildenbrand
  2025-06-30 17:45   ` Zi Yan
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-27 18:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
>
> So let's provide a basic, non-inlined folio_pte_batch().

Hm, but why non-inlined, when it invokes an inlined function? Seems odd no?

>
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
>
> While at it, drop the "addr" parameter that is unused.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Other than the query above + nit on name below, this is really nice!

> ---
>  mm/internal.h  | 11 ++++++++---
>  mm/madvise.c   |  4 ++--
>  mm/memory.c    |  6 ++----
>  mm/mempolicy.c |  3 +--
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  mm/util.c      | 29 +++++++++++++++++++++++++++++
>  8 files changed, 45 insertions(+), 17 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index ca6590c6d9eab..6000b683f68ee 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  }
>
>  /**
> - * folio_pte_batch - detect a PTE batch for a large folio
> + * folio_pte_batch_ext - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
> - * @addr: The user virtual address the first page is mapped at.
>   * @ptep: Page table pointer for the first entry.
>   * @pte: Page table entry for the first page.
>   * @max_nr: The maximum number of table entries to consider.
> @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * must be limited by the caller so scanning cannot exceed a single VMA and
>   * a single page table.
>   *
> + * This function will be inlined to optimize based on the input parameters;
> + * consider using folio_pte_batch() instead if applicable.
> + *
>   * Return: the number of table entries in the batch.
>   */
> -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> +static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>  		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>  		bool *any_writable, bool *any_young, bool *any_dirty)

Sorry this is really really annoying feedback :P but _ext() makes me think of
page_ext and ugh :))

Wonder if __folio_pte_batch() is better?

This is obviously, not a big deal (TM)

>  {
> @@ -293,6 +295,9 @@ static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long ad
>  	return min(nr, max_nr);
>  }
>
> +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
> +		unsigned int max_nr);
> +
>  /**
>   * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>   *	 forward or backward by delta
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 661bb743d2216..9b9c35a398ed0 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -349,8 +349,8 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>  {
>  	int max_nr = (end - addr) / PAGE_SIZE;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -			       any_young, any_dirty);
> +	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
> +				   any_young, any_dirty);
>  }
>
>  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> diff --git a/mm/memory.c b/mm/memory.c
> index ab2d6c1425691..43d35d6675f2e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -995,7 +995,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  		if (vma_soft_dirty_enabled(src_vma))
>  			flags |= FPB_HONOR_SOFT_DIRTY;
>
> -		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
> +		nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
>  				     &any_writable, NULL, NULL);
>  		folio_ref_add(folio, nr);
>  		if (folio_test_anon(folio)) {
> @@ -1564,9 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>  	 * by keeping the batching logic separate.
>  	 */
>  	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> -		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
> -				     NULL, NULL, NULL);
> -
> +		nr = folio_pte_batch(folio, pte, ptent, max_nr);
>  		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>  				       addr, details, rss, force_flush,
>  				       force_break, any_skipped);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 2a25eedc3b1c0..eb83cff7db8c3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -711,8 +711,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>  		if (!folio || folio_is_zone_device(folio))
>  			continue;
>  		if (folio_test_large(folio) && max_nr != 1)
> -			nr = folio_pte_batch(folio, addr, pte, ptent,
> -					     max_nr, 0, NULL, NULL, NULL);
> +			nr = folio_pte_batch(folio, pte, ptent, max_nr);
>  		/*
>  		 * vm_normal_folio() filters out zero pages, but there might
>  		 * still be reserved folios to skip, perhaps in a VDSO.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 2238cdc5eb1c1..a1d93ad33c6db 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -313,8 +313,7 @@ static inline unsigned int folio_mlock_step(struct folio *folio,
>  	if (!folio_test_large(folio))
>  		return 1;
>
> -	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
> -			       NULL, NULL);
> +	return folio_pte_batch(folio, pte, ptent, count);
>  }
>
>  static inline bool allow_mlock_munlock(struct folio *folio,
> diff --git a/mm/mremap.c b/mm/mremap.c
> index d4d3ffc931502..1f5bebbb9c0cb 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -182,8 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>  	if (!folio || !folio_test_large(folio))
>  		return 1;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -			       NULL, NULL);
> +	return folio_pte_batch(folio, ptep, pte, max_nr);
>  }
>
>  static int move_ptes(struct pagetable_move_control *pmc,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a29d7d29c7283..6658968600b72 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1859,8 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>  	if (pte_pfn(pte) != folio_pfn(folio))
>  		return false;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -			       NULL, NULL) == max_nr;
> +	return folio_pte_batch(folio, ptep, pte, max_nr);
>  }
>
>  /*
> diff --git a/mm/util.c b/mm/util.c
> index 0b270c43d7d12..d29dcc135ad28 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1171,3 +1171,32 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
>  	return 0;
>  }
>  EXPORT_SYMBOL(compat_vma_mmap_prepare);
> +
> +#ifdef CONFIG_MMU
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + *
> + * This is a simplified variant of folio_pte_batch_ext().
> + *
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> + * pages of the same large folio in a single VMA and a single page table.
> + *
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> + * the accessed bit, writable bit, dirt-bit and soft-dirty bit.
> + *
> + * ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single VMA and
> + * a single page table.
> + *
> + * Return: the number of table entries in the batch.
> + */
> +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
> +		unsigned int max_nr)
> +{
> +	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
> +}
> +#endif /* CONFIG_MMU */
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-27 11:55 ` [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
  2025-06-27 14:34   ` Lance Yang
@ 2025-06-27 19:04   ` Lorenzo Stoakes
  2025-06-30  9:32     ` David Hildenbrand
  2025-06-30 17:59   ` Zi Yan
  2 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-27 19:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 01:55:10PM +0200, David Hildenbrand wrote:
> Instead, let's just allow for specifying through flags whether we want
> to have bits merged into the original PTE.
>
> For the madvise() case, simplify by having only a single parameter for
> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
> dirty bit is not required, but also not harmful. This code is not that
> performance critical after all to really force all micro-optimizations.
>
> As we now have two pte_t * parameters, use PageTable() to make sure we
> are actually given a pointer at a copy of the PTE, not a pointer into
> an actual page table.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Overall a really nice cleanup! Just some comments below.

> ---
>  mm/internal.h | 58 +++++++++++++++++++++++++++++++--------------------
>  mm/madvise.c  | 26 +++++------------------
>  mm/memory.c   |  8 ++-----
>  mm/util.c     |  2 +-
>  4 files changed, 43 insertions(+), 51 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 6000b683f68ee..fe69e21b34a24 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
>  /* Compare PTEs honoring the soft-dirty bit. */
>  #define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
>
> +/*
> + * Merge PTE write bits: if any PTE in the batch is writable, modify the
> + * PTE at @ptentp to be writable.
> + */
> +#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
> +
> +/*
> + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
> + * modify the PTE at @ptentp to be young or dirty, respectively.
> + */
> +#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
> +
>  static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  {
>  	if (!(flags & FPB_HONOR_DIRTY))
> @@ -220,16 +232,11 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  /**
>   * folio_pte_batch_ext - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
> + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>   * @ptep: Page table pointer for the first entry.
> - * @pte: Page table entry for the first page.
> + * @ptentp: Pointer at a copy of the first page table entry.

This seems weird to me, I know it's a pointer to a copy of the PTE, essentially
replacing the pte param from before, but now it's also an output value?
Shouldn't this be made clear?

I know it's a pain and churn but if this is now meant to be an output var we
should probably make it the last param too.

At least needs an (output) or something here.

>   * @max_nr: The maximum number of table entries to consider.
>   * @flags: Flags to modify the PTE batch semantics.
> - * @any_writable: Optional pointer to indicate whether any entry except the
> - *		  first one is writable.
> - * @any_young: Optional pointer to indicate whether any entry except the
> - *		  first one is young.
> - * @any_dirty: Optional pointer to indicate whether any entry except the
> - *		  first one is dirty.
>   *
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>   * pages of the same large folio in a single VMA and a single page table.
> @@ -242,28 +249,26 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * must be limited by the caller so scanning cannot exceed a single VMA and
>   * a single page table.
>   *
> + * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
> + * be modified.

This explains that you modify it but it doesn't really stand out as an output
parameter.

> + *
>   * This function will be inlined to optimize based on the input parameters;
>   * consider using folio_pte_batch() instead if applicable.
>   *
>   * Return: the number of table entries in the batch.
>   */
>  static inline unsigned int folio_pte_batch_ext(struct folio *folio,
> -		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
> -		bool *any_writable, bool *any_young, bool *any_dirty)
> +		struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
> +		unsigned int max_nr, fpb_t flags)
>  {
> +	bool any_writable = false, any_young = false, any_dirty = false;
> +	pte_t expected_pte, pte = *ptentp;
>  	unsigned int nr, cur_nr;
> -	pte_t expected_pte;
> -
> -	if (any_writable)
> -		*any_writable = false;
> -	if (any_young)
> -		*any_young = false;
> -	if (any_dirty)
> -		*any_dirty = false;
>
>  	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>  	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>  	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
> +	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));

Hm so if !virt_addr_valid(ptentp) we're ok? :P I also think a quick comment here
would help, the commit message explains it but glancing at this I'd be confused.

Something like:

/* Ensure this is a pointer to a copy not a pointer into a page table. */

>
>  	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
>  	max_nr = min_t(unsigned long, max_nr,
> @@ -279,12 +284,12 @@ static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>  		if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
>  			break;
>
> -		if (any_writable)
> -			*any_writable |= pte_write(pte);
> -		if (any_young)
> -			*any_young |= pte_young(pte);
> -		if (any_dirty)
> -			*any_dirty |= pte_dirty(pte);
> +		if (flags & FPB_MERGE_WRITE)
> +			any_writable |= pte_write(pte);
> +		if (flags & FPB_MERGE_YOUNG_DIRTY) {
> +			any_young |= pte_young(pte);
> +			any_dirty |= pte_dirty(pte);
> +		}
>
>  		cur_nr = pte_batch_hint(ptep, pte);
>  		expected_pte = pte_advance_pfn(expected_pte, cur_nr);
> @@ -292,6 +297,13 @@ static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>  		nr += cur_nr;
>  	}
>
> +	if (any_writable)
> +		*ptentp = pte_mkwrite(*ptentp, vma);
> +	if (any_young)
> +		*ptentp = pte_mkyoung(*ptentp);
> +	if (any_dirty)
> +		*ptentp = pte_mkdirty(*ptentp);
> +
>  	return min(nr, max_nr);
>  }
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9b9c35a398ed0..dce8f5e8555cb 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -344,13 +344,12 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
>
>  static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>  					  struct folio *folio, pte_t *ptep,
> -					  pte_t pte, bool *any_young,
> -					  bool *any_dirty)
> +					  pte_t *ptentp)
>  {
>  	int max_nr = (end - addr) / PAGE_SIZE;
>
> -	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
> -				   any_young, any_dirty);
> +	return folio_pte_batch_ext(folio, NULL, ptep, ptentp, max_nr,
> +				   FPB_MERGE_YOUNG_DIRTY);
>  }
>
>  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> @@ -488,13 +487,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		 * next pte in the range.
>  		 */
>  		if (folio_test_large(folio)) {
> -			bool any_young;
> -
> -			nr = madvise_folio_pte_batch(addr, end, folio, pte,
> -						     ptent, &any_young, NULL);
> -			if (any_young)
> -				ptent = pte_mkyoung(ptent);
> -
> +			nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent);
>  			if (nr < folio_nr_pages(folio)) {
>  				int err;
>
> @@ -724,11 +717,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		 * next pte in the range.
>  		 */
>  		if (folio_test_large(folio)) {
> -			bool any_young, any_dirty;
> -
> -			nr = madvise_folio_pte_batch(addr, end, folio, pte,
> -						     ptent, &any_young, &any_dirty);
> -
> +			nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent);
>  			if (nr < folio_nr_pages(folio)) {
>  				int err;
>
> @@ -753,11 +742,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  					nr = 0;
>  				continue;
>  			}
> -
> -			if (any_young)
> -				ptent = pte_mkyoung(ptent);
> -			if (any_dirty)
> -				ptent = pte_mkdirty(ptent);
>  		}
>
>  		if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
> diff --git a/mm/memory.c b/mm/memory.c
> index 43d35d6675f2e..985d09bee44fd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -972,10 +972,9 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  		 pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
>  		 int max_nr, int *rss, struct folio **prealloc)
>  {
> +	fpb_t flags = FPB_MERGE_WRITE;
>  	struct page *page;
>  	struct folio *folio;
> -	bool any_writable;
> -	fpb_t flags = 0;
>  	int err, nr;
>
>  	page = vm_normal_page(src_vma, addr, pte);
> @@ -995,8 +994,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  		if (vma_soft_dirty_enabled(src_vma))
>  			flags |= FPB_HONOR_SOFT_DIRTY;
>
> -		nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
> -				     &any_writable, NULL, NULL);
> +		nr = folio_pte_batch_ext(folio, src_vma, src_pte, &pte, max_nr, flags);
>  		folio_ref_add(folio, nr);
>  		if (folio_test_anon(folio)) {
>  			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
> @@ -1010,8 +1008,6 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  			folio_dup_file_rmap_ptes(folio, page, nr, dst_vma);
>  			rss[mm_counter_file(folio)] += nr;
>  		}
> -		if (any_writable)
> -			pte = pte_mkwrite(pte, src_vma);
>  		__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
>  				    addr, nr);
>  		return nr;
> diff --git a/mm/util.c b/mm/util.c
> index d29dcc135ad28..19d1a5814fac7 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1197,6 +1197,6 @@ EXPORT_SYMBOL(compat_vma_mmap_prepare);
>  unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
>  		unsigned int max_nr)
>  {
> -	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
> +	return folio_pte_batch_ext(folio, NULL, ptep, &pte, max_nr, 0);
>  }
>  #endif /* CONFIG_MMU */
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
  2025-06-27 13:40   ` Lance Yang
  2025-06-27 16:28   ` Lorenzo Stoakes
@ 2025-06-28  3:37   ` Dev Jain
  2025-06-28 21:00     ` David Hildenbrand
  2025-06-30 14:35   ` Zi Yan
  2025-07-02  8:31   ` Oscar Salvador
  4 siblings, 1 reply; 58+ messages in thread
From: Dev Jain @ 2025-06-28  3:37 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo, Ryan Roberts


On 27/06/25 5:25 pm, David Hildenbrand wrote:
> Honoring these PTE bits is the exception, so let's invert the meaning.
>
> With this change, most callers don't have to pass any flags.
>
> No functional change intended.

FWIW I had proposed this kind of change earlier to Ryan (CCed) and
he pointed out: "Doesn't that argument apply in reverse if you want
to ignore something new in future?

By default we are comparing all the bits in the pte when determining the batch.
The flags request to ignore certain bits. If we want to ignore extra bits in
future, we add new flags and the existing callers don't need to be updated.

With your approach, every time you want the ability to ignore extra bits, you
need to update all the callers, which is error prone."

>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/internal.h  | 16 ++++++++--------
>   mm/madvise.c   |  3 +--
>   mm/memory.c    | 11 +++++------
>   mm/mempolicy.c |  4 +---
>   mm/mlock.c     |  3 +--
>   mm/mremap.c    |  3 +--
>   mm/rmap.c      |  3 +--
>   7 files changed, 18 insertions(+), 25 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e84217e27778d..9690c75063881 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
>   /* Flags for folio_pte_batch(). */
>   typedef int __bitwise fpb_t;
>   
> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> -#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
> +/* Compare PTEs honoring the dirty bit. */
> +#define FPB_HONOR_DIRTY		((__force fpb_t)BIT(0))
>   
> -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> -#define FPB_IGNORE_SOFT_DIRTY		((__force fpb_t)BIT(1))
> +/* Compare PTEs honoring the soft-dirty bit. */
> +#define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
>   
>   static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   {
> -	if (flags & FPB_IGNORE_DIRTY)
> +	if (!(flags & FPB_HONOR_DIRTY))
>   		pte = pte_mkclean(pte);
> -	if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
> +	if (likely(!(flags & FPB_HONOR_SOFT_DIRTY)))
>   		pte = pte_clear_soft_dirty(pte);
>   	return pte_wrprotect(pte_mkold(pte));
>   }
> @@ -236,8 +236,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>    * pages of the same large folio.
>    *
>    * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> - * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> - * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> + * the accessed bit, writable bit, dirty bit (unless FPB_HONOR_DIRTY is set) and
> + * soft-dirty bit (unless FPB_HONOR_SOFT_DIRTY is set).
>    *
>    * start_ptep must map any page of the folio. max_nr must be at least one and
>    * must be limited by the caller so scanning cannot exceed a single page table.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e61e32b2cd91f..661bb743d2216 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -347,10 +347,9 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>   					  pte_t pte, bool *any_young,
>   					  bool *any_dirty)
>   {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>   	int max_nr = (end - addr) / PAGE_SIZE;
>   
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>   			       any_young, any_dirty);
>   }
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b7..ab2d6c1425691 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -990,10 +990,10 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>   	 * by keeping the batching logic separate.
>   	 */
>   	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
> -		if (src_vma->vm_flags & VM_SHARED)
> -			flags |= FPB_IGNORE_DIRTY;
> -		if (!vma_soft_dirty_enabled(src_vma))
> -			flags |= FPB_IGNORE_SOFT_DIRTY;
> +		if (!(src_vma->vm_flags & VM_SHARED))
> +			flags |= FPB_HONOR_DIRTY;
> +		if (vma_soft_dirty_enabled(src_vma))
> +			flags |= FPB_HONOR_SOFT_DIRTY;
>   
>   		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
>   				     &any_writable, NULL, NULL);
> @@ -1535,7 +1535,6 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>   		struct zap_details *details, int *rss, bool *force_flush,
>   		bool *force_break, bool *any_skipped)
>   {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>   	struct mm_struct *mm = tlb->mm;
>   	struct folio *folio;
>   	struct page *page;
> @@ -1565,7 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>   	 * by keeping the batching logic separate.
>   	 */
>   	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> -		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> +		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
>   				     NULL, NULL, NULL);
>   
>   		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 1ff7b2174eb77..2a25eedc3b1c0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -675,7 +675,6 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
>   static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>   			unsigned long end, struct mm_walk *walk)
>   {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>   	struct vm_area_struct *vma = walk->vma;
>   	struct folio *folio;
>   	struct queue_pages *qp = walk->private;
> @@ -713,8 +712,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>   			continue;
>   		if (folio_test_large(folio) && max_nr != 1)
>   			nr = folio_pte_batch(folio, addr, pte, ptent,
> -					     max_nr, fpb_flags,
> -					     NULL, NULL, NULL);
> +					     max_nr, 0, NULL, NULL, NULL);
>   		/*
>   		 * vm_normal_folio() filters out zero pages, but there might
>   		 * still be reserved folios to skip, perhaps in a VDSO.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 3cb72b579ffd3..2238cdc5eb1c1 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -307,14 +307,13 @@ void munlock_folio(struct folio *folio)
>   static inline unsigned int folio_mlock_step(struct folio *folio,
>   		pte_t *pte, unsigned long addr, unsigned long end)
>   {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>   	unsigned int count = (end - addr) >> PAGE_SHIFT;
>   	pte_t ptent = ptep_get(pte);
>   
>   	if (!folio_test_large(folio))
>   		return 1;
>   
> -	return folio_pte_batch(folio, addr, pte, ptent, count, fpb_flags, NULL,
> +	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
>   			       NULL, NULL);
>   }
>   
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 36585041c760d..d4d3ffc931502 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -173,7 +173,6 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>   static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>   		pte_t *ptep, pte_t pte, int max_nr)
>   {
> -	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>   	struct folio *folio;
>   
>   	if (max_nr == 1)
> @@ -183,7 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>   	if (!folio || !folio_test_large(folio))
>   		return 1;
>   
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>   			       NULL, NULL);
>   }
>   
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 3b74bb19c11dd..a29d7d29c7283 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1849,7 +1849,6 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
>   static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>   			struct folio *folio, pte_t *ptep)
>   {
> -	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>   	int max_nr = folio_nr_pages(folio);
>   	pte_t pte = ptep_get(ptep);
>   
> @@ -1860,7 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>   	if (pte_pfn(pte) != folio_pfn(folio))
>   		return false;
>   
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
>   			       NULL, NULL) == max_nr;
>   }
>   

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-28  3:37   ` Dev Jain
@ 2025-06-28 21:00     ` David Hildenbrand
  2025-06-30  3:34       ` Dev Jain
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-28 21:00 UTC (permalink / raw)
  To: Dev Jain, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo, Ryan Roberts

On 28.06.25 05:37, Dev Jain wrote:
> 
> On 27/06/25 5:25 pm, David Hildenbrand wrote:
>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>
>> With this change, most callers don't have to pass any flags.
>>
>> No functional change intended.
> 
> FWIW I had proposed this kind of change earlier to Ryan (CCed) and
> he pointed out: "Doesn't that argument apply in reverse if you want
> to ignore something new in future?
> 
> By default we are comparing all the bits in the pte when determining the batch.
> The flags request to ignore certain bits.

That statement is not true: as default we ignore the write and young 
bit. And we don't have flags for that ;)

Now we also ignore the dirty and soft-dity bit as default, unless told 
not to do that by one very specific caller.

> If we want to ignore extra bits in
> future, we add new flags and the existing callers don't need to be updated.

What stops you from using FPB_IGNORE_* for something else in the future?

As a side note, there are not that many relevant PTE bits to worry about 
in the near future ;)

I mean, uffd-wp, sure, ... and before we add a FPB_HONOR_UFFD_WP to all 
users to be safe (and changing the default to ignore), you could add a 
FPB_IGNORE_UFFD_WP first, to then check who really can tolerate just 
ignoring it (most of them, I assume).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 16:33       ` Lorenzo Stoakes
@ 2025-06-29  8:59         ` Mike Rapoport
  2025-06-30 13:47           ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Mike Rapoport @ 2025-06-29  8:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 05:33:06PM +0100, Lorenzo Stoakes wrote:
> On Fri, Jun 27, 2025 at 06:30:13PM +0200, David Hildenbrand wrote:
> > On 27.06.25 18:28, Lorenzo Stoakes wrote:
> > > On Fri, Jun 27, 2025 at 01:55:07PM +0200, David Hildenbrand wrote:
> > > > Honoring these PTE bits is the exception, so let's invert the meaning.
> > > >
> > > > With this change, most callers don't have to pass any flags.
> > > >
> > > > No functional change intended.
> > > >
> > > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > >
> > > This is a nice change, it removes a lot of code I really didn't enjoy
> > > looking at for introducing these flags all over the place.
> > >
> > > But a nit on the naming below, I'm not a fan of 'honor' here :)
> > >
> > > > ---
> > > >   mm/internal.h  | 16 ++++++++--------
> > > >   mm/madvise.c   |  3 +--
> > > >   mm/memory.c    | 11 +++++------
> > > >   mm/mempolicy.c |  4 +---
> > > >   mm/mlock.c     |  3 +--
> > > >   mm/mremap.c    |  3 +--
> > > >   mm/rmap.c      |  3 +--
> > > >   7 files changed, 18 insertions(+), 25 deletions(-)
> > > >
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index e84217e27778d..9690c75063881 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
> > > >   /* Flags for folio_pte_batch(). */
> > > >   typedef int __bitwise fpb_t;
> > > >
> > > > -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> > > > -#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
> > > > +/* Compare PTEs honoring the dirty bit. */
> > > > +#define FPB_HONOR_DIRTY		((__force fpb_t)BIT(0))
> > >
> > > Hm not to be petty but... :)
> > >
> > > I'm not sure I find 'honor' very clear here. Ignore is very clear, 'honor' (God
> > > the British English in me wants to say honour here but stipp :P) doesn't
> > > necessarily tell you what is going to happen.
> > >
> > > Perhaps PROPAGATE? or OBEY?
> >
> > RESPECT? :)

DONT_IGNORE ;-)

> 🎵 R-E-S-P-E-C-T find out what it means to me... ;) 🎵
> 
> This works too :>)

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-28 21:00     ` David Hildenbrand
@ 2025-06-30  3:34       ` Dev Jain
  2025-06-30  9:04         ` Ryan Roberts
  0 siblings, 1 reply; 58+ messages in thread
From: Dev Jain @ 2025-06-30  3:34 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo, Ryan Roberts


On 29/06/25 2:30 am, David Hildenbrand wrote:
> On 28.06.25 05:37, Dev Jain wrote:
>>
>> On 27/06/25 5:25 pm, David Hildenbrand wrote:
>>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>>
>>> With this change, most callers don't have to pass any flags.
>>>
>>> No functional change intended.
>>
>> FWIW I had proposed this kind of change earlier to Ryan (CCed) and
>> he pointed out: "Doesn't that argument apply in reverse if you want
>> to ignore something new in future?
>>
>> By default we are comparing all the bits in the pte when determining 
>> the batch.
>> The flags request to ignore certain bits.
>
> That statement is not true: as default we ignore the write and young 
> bit. And we don't have flags for that ;)
>
> Now we also ignore the dirty and soft-dity bit as default, unless told 
> not to do that by one very specific caller.
>
>> If we want to ignore extra bits in
>> future, we add new flags and the existing callers don't need to be 
>> updated.
>
> What stops you from using FPB_IGNORE_* for something else in the future?
>
> As a side note, there are not that many relevant PTE bits to worry 
> about in the near future ;)
>
> I mean, uffd-wp, sure, ... and before we add a FPB_HONOR_UFFD_WP to 
> all users to be safe (and changing the default to ignore), you could 
> add a FPB_IGNORE_UFFD_WP first, to then check who really can tolerate 
> just ignoring it (most of them, I assume).
I agree.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-30  3:34       ` Dev Jain
@ 2025-06-30  9:04         ` Ryan Roberts
  2025-06-30  9:08           ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Ryan Roberts @ 2025-06-30  9:04 UTC (permalink / raw)
  To: Dev Jain, David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 30/06/2025 04:34, Dev Jain wrote:
> 
> On 29/06/25 2:30 am, David Hildenbrand wrote:
>> On 28.06.25 05:37, Dev Jain wrote:
>>>
>>> On 27/06/25 5:25 pm, David Hildenbrand wrote:
>>>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>>>
>>>> With this change, most callers don't have to pass any flags.
>>>>
>>>> No functional change intended.
>>>
>>> FWIW I had proposed this kind of change earlier to Ryan (CCed) and
>>> he pointed out: "Doesn't that argument apply in reverse if you want
>>> to ignore something new in future?
>>>
>>> By default we are comparing all the bits in the pte when determining the batch.
>>> The flags request to ignore certain bits.
>>
>> That statement is not true: as default we ignore the write and young bit. And
>> we don't have flags for that ;)
>>
>> Now we also ignore the dirty and soft-dity bit as default, unless told not to
>> do that by one very specific caller.
>>
>>> If we want to ignore extra bits in
>>> future, we add new flags and the existing callers don't need to be updated.
>>
>> What stops you from using FPB_IGNORE_* for something else in the future?
>>
>> As a side note, there are not that many relevant PTE bits to worry about in
>> the near future ;)
>>
>> I mean, uffd-wp, sure, ... and before we add a FPB_HONOR_UFFD_WP to all users
>> to be safe (and changing the default to ignore), you could add a
>> FPB_IGNORE_UFFD_WP first, to then check who really can tolerate just ignoring
>> it (most of them, I assume).
> I agree.

Meh. Personally I think if you start mixing HONOR and IGNORE flags, it becomes
very confusing to work out what is being checked for and what is not. I stand by
my original view. But yeah, writable and young confuse it a bit... How about
generalizing by explicitly requiring IGNORE flags for write and young, then also
create a flags macro for the common case?

#define FPB_IGNORE_COMMON (FPB_IGNORE_WRITE | FPB_IGNORE_YOUNG |	\
			   FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY)

It's not a hill I'm going to die on though...

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-30  9:04         ` Ryan Roberts
@ 2025-06-30  9:08           ` David Hildenbrand
  2025-06-30  9:18             ` Ryan Roberts
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30  9:08 UTC (permalink / raw)
  To: Ryan Roberts, Dev Jain, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 30.06.25 11:04, Ryan Roberts wrote:
> On 30/06/2025 04:34, Dev Jain wrote:
>>
>> On 29/06/25 2:30 am, David Hildenbrand wrote:
>>> On 28.06.25 05:37, Dev Jain wrote:
>>>>
>>>> On 27/06/25 5:25 pm, David Hildenbrand wrote:
>>>>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>>>>
>>>>> With this change, most callers don't have to pass any flags.
>>>>>
>>>>> No functional change intended.
>>>>
>>>> FWIW I had proposed this kind of change earlier to Ryan (CCed) and
>>>> he pointed out: "Doesn't that argument apply in reverse if you want
>>>> to ignore something new in future?
>>>>
>>>> By default we are comparing all the bits in the pte when determining the batch.
>>>> The flags request to ignore certain bits.
>>>
>>> That statement is not true: as default we ignore the write and young bit. And
>>> we don't have flags for that ;)
>>>
>>> Now we also ignore the dirty and soft-dity bit as default, unless told not to
>>> do that by one very specific caller.
>>>
>>>> If we want to ignore extra bits in
>>>> future, we add new flags and the existing callers don't need to be updated.
>>>
>>> What stops you from using FPB_IGNORE_* for something else in the future?
>>>
>>> As a side note, there are not that many relevant PTE bits to worry about in
>>> the near future ;)
>>>
>>> I mean, uffd-wp, sure, ... and before we add a FPB_HONOR_UFFD_WP to all users
>>> to be safe (and changing the default to ignore), you could add a
>>> FPB_IGNORE_UFFD_WP first, to then check who really can tolerate just ignoring
>>> it (most of them, I assume).
>> I agree.
> 
> Meh. Personally I think if you start mixing HONOR and IGNORE flags, it becomes
> very confusing to work out what is being checked for and what is not. I stand by
> my original view. But yeah, writable and young confuse it a bit... How about
> generalizing by explicitly requiring IGNORE flags for write and young, then also
> create a flags macro for the common case?
> 
> #define FPB_IGNORE_COMMON (FPB_IGNORE_WRITE | FPB_IGNORE_YOUNG |	\
> 			   FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY)
> 
> It's not a hill I'm going to die on though...

How about we make this function simpler, not more complicated? ;)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-30  9:08           ` David Hildenbrand
@ 2025-06-30  9:18             ` Ryan Roberts
  2025-06-30  9:24               ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Ryan Roberts @ 2025-06-30  9:18 UTC (permalink / raw)
  To: David Hildenbrand, Dev Jain, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 30/06/2025 10:08, David Hildenbrand wrote:
> On 30.06.25 11:04, Ryan Roberts wrote:
>> On 30/06/2025 04:34, Dev Jain wrote:
>>>
>>> On 29/06/25 2:30 am, David Hildenbrand wrote:
>>>> On 28.06.25 05:37, Dev Jain wrote:
>>>>>
>>>>> On 27/06/25 5:25 pm, David Hildenbrand wrote:
>>>>>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>>>>>
>>>>>> With this change, most callers don't have to pass any flags.
>>>>>>
>>>>>> No functional change intended.
>>>>>
>>>>> FWIW I had proposed this kind of change earlier to Ryan (CCed) and
>>>>> he pointed out: "Doesn't that argument apply in reverse if you want
>>>>> to ignore something new in future?
>>>>>
>>>>> By default we are comparing all the bits in the pte when determining the
>>>>> batch.
>>>>> The flags request to ignore certain bits.
>>>>
>>>> That statement is not true: as default we ignore the write and young bit. And
>>>> we don't have flags for that ;)
>>>>
>>>> Now we also ignore the dirty and soft-dity bit as default, unless told not to
>>>> do that by one very specific caller.
>>>>
>>>>> If we want to ignore extra bits in
>>>>> future, we add new flags and the existing callers don't need to be updated.
>>>>
>>>> What stops you from using FPB_IGNORE_* for something else in the future?
>>>>
>>>> As a side note, there are not that many relevant PTE bits to worry about in
>>>> the near future ;)
>>>>
>>>> I mean, uffd-wp, sure, ... and before we add a FPB_HONOR_UFFD_WP to all users
>>>> to be safe (and changing the default to ignore), you could add a
>>>> FPB_IGNORE_UFFD_WP first, to then check who really can tolerate just ignoring
>>>> it (most of them, I assume).
>>> I agree.
>>
>> Meh. Personally I think if you start mixing HONOR and IGNORE flags, it becomes
>> very confusing to work out what is being checked for and what is not. I stand by
>> my original view. But yeah, writable and young confuse it a bit... How about
>> generalizing by explicitly requiring IGNORE flags for write and young, then also
>> create a flags macro for the common case?
>>
>> #define FPB_IGNORE_COMMON (FPB_IGNORE_WRITE | FPB_IGNORE_YOUNG |    \
>>                FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY)
>>
>> It's not a hill I'm going to die on though...
> 
> How about we make this function simpler, not more complicated? ;)

I think here we both have different views of what is simpler... You are trying
to optimize for the callers writing less code. I'm trying to optimize for the
reader to be able to easily determine what the function will do for a given caller.

But it's your name above the door :) I don't want to bikeshed, so go ahead
change to HONOR.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 18:48   ` Lorenzo Stoakes
@ 2025-06-30  9:19     ` David Hildenbrand
  2025-06-30 10:41       ` Lorenzo Stoakes
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30  9:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 27.06.25 20:48, Lorenzo Stoakes wrote:
> On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
>> Many users (including upcoming ones) don't really need the flags etc,
>> and can live with a function call.
>>
>> So let's provide a basic, non-inlined folio_pte_batch().
> 
> Hm, but why non-inlined, when it invokes an inlined function? Seems odd no?

We want to always generate a function that uses as little runtime checks 
as possible. Essentially, optimize out the "flags" as much as possible.

In case of folio_pte_batch(), where we won't use any flags, any checks 
will be optimized out by the compiler.

So we get a single, specialized, non-inlined function.

> 
>>
>> In zap_present_ptes(), where we care about performance, the compiler
>> already seem to generate a call to a common inlined folio_pte_batch()
>> variant, shared with fork() code. So calling the new non-inlined variant
>> should not make a difference.
>>
>> While at it, drop the "addr" parameter that is unused.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Other than the query above + nit on name below, this is really nice!
> 
>> ---
>>   mm/internal.h  | 11 ++++++++---
>>   mm/madvise.c   |  4 ++--
>>   mm/memory.c    |  6 ++----
>>   mm/mempolicy.c |  3 +--
>>   mm/mlock.c     |  3 +--
>>   mm/mremap.c    |  3 +--
>>   mm/rmap.c      |  3 +--
>>   mm/util.c      | 29 +++++++++++++++++++++++++++++
>>   8 files changed, 45 insertions(+), 17 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index ca6590c6d9eab..6000b683f68ee 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   }
>>
>>   /**
>> - * folio_pte_batch - detect a PTE batch for a large folio
>> + * folio_pte_batch_ext - detect a PTE batch for a large folio
>>    * @folio: The large folio to detect a PTE batch for.
>> - * @addr: The user virtual address the first page is mapped at.
>>    * @ptep: Page table pointer for the first entry.
>>    * @pte: Page table entry for the first page.
>>    * @max_nr: The maximum number of table entries to consider.
>> @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>    * must be limited by the caller so scanning cannot exceed a single VMA and
>>    * a single page table.
>>    *
>> + * This function will be inlined to optimize based on the input parameters;
>> + * consider using folio_pte_batch() instead if applicable.
>> + *
>>    * Return: the number of table entries in the batch.
>>    */
>> -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
>> +static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>>   		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>>   		bool *any_writable, bool *any_young, bool *any_dirty)
> 
> Sorry this is really really annoying feedback :P but _ext() makes me think of
> page_ext and ugh :))
> 
> Wonder if __folio_pte_batch() is better?
> 
> This is obviously, not a big deal (TM)

Obviously, I had that as part of the development, and decided against it 
at some point. :)

Yeah, _ext() is not common in MM yet, in contrast to other subsystems. 
The only user is indeed page_ext. On arm we seem to have set_pte_ext(). 
But it's really "page_ext", that's the problematic part, not "_ext" :P

No strong opinion, but I tend to dislike here "__", because often it 
means "internal helper you're not supposed to used", which isn't really 
the case here.

E.g.,

alloc_frozen_pages() -> alloc_frozen_pages_noprof() -> 
__alloc_frozen_pages_noprof()

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-30  9:18             ` Ryan Roberts
@ 2025-06-30  9:24               ` David Hildenbrand
  2025-06-30 10:57                 ` Ryan Roberts
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30  9:24 UTC (permalink / raw)
  To: Ryan Roberts, Dev Jain, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 30.06.25 11:18, Ryan Roberts wrote:
> On 30/06/2025 10:08, David Hildenbrand wrote:
>> On 30.06.25 11:04, Ryan Roberts wrote:
>>> On 30/06/2025 04:34, Dev Jain wrote:
>>>>
>>>> On 29/06/25 2:30 am, David Hildenbrand wrote:
>>>>> On 28.06.25 05:37, Dev Jain wrote:
>>>>>>
>>>>>> On 27/06/25 5:25 pm, David Hildenbrand wrote:
>>>>>>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>>>>>>
>>>>>>> With this change, most callers don't have to pass any flags.
>>>>>>>
>>>>>>> No functional change intended.
>>>>>>
>>>>>> FWIW I had proposed this kind of change earlier to Ryan (CCed) and
>>>>>> he pointed out: "Doesn't that argument apply in reverse if you want
>>>>>> to ignore something new in future?
>>>>>>
>>>>>> By default we are comparing all the bits in the pte when determining the
>>>>>> batch.
>>>>>> The flags request to ignore certain bits.
>>>>>
>>>>> That statement is not true: as default we ignore the write and young bit. And
>>>>> we don't have flags for that ;)
>>>>>
>>>>> Now we also ignore the dirty and soft-dity bit as default, unless told not to
>>>>> do that by one very specific caller.
>>>>>
>>>>>> If we want to ignore extra bits in
>>>>>> future, we add new flags and the existing callers don't need to be updated.
>>>>>
>>>>> What stops you from using FPB_IGNORE_* for something else in the future?
>>>>>
>>>>> As a side note, there are not that many relevant PTE bits to worry about in
>>>>> the near future ;)
>>>>>
>>>>> I mean, uffd-wp, sure, ... and before we add a FPB_HONOR_UFFD_WP to all users
>>>>> to be safe (and changing the default to ignore), you could add a
>>>>> FPB_IGNORE_UFFD_WP first, to then check who really can tolerate just ignoring
>>>>> it (most of them, I assume).
>>>> I agree.
>>>
>>> Meh. Personally I think if you start mixing HONOR and IGNORE flags, it becomes
>>> very confusing to work out what is being checked for and what is not. I stand by
>>> my original view. But yeah, writable and young confuse it a bit... How about
>>> generalizing by explicitly requiring IGNORE flags for write and young, then also
>>> create a flags macro for the common case?
>>>
>>> #define FPB_IGNORE_COMMON (FPB_IGNORE_WRITE | FPB_IGNORE_YOUNG |    \
>>>                 FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY)
>>>
>>> It's not a hill I'm going to die on though...
>>
>> How about we make this function simpler, not more complicated? ;)
> 
> I think here we both have different views of what is simpler... You are trying
> to optimize for the callers writing less code. I'm trying to optimize for the
> reader to be able to easily determine what the function will do for a given caller.

See patch number #3: I want the default function -- folio_pte_batch() -- 
to not have any flags at all.

And I don't want to achieve that by internally using flags when calling 
folio_pte_batch_ext().

If you don't specify flags (folio_pte_batch()), behave just as if 
calling folio_pte_batch_ext() without flags. Anything else would be more 
confusing IMHO.

I agree that mixing HONOR and IGNORE is not a good idea. But then, it's 
really only uffd-wp that still could be batched, and likely we want it 
to be the default, and respect/honor/whatever instead in the cases where 
we really have to.

(If we really want to go down that path and batch it :) )

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-27 19:04   ` Lorenzo Stoakes
@ 2025-06-30  9:32     ` David Hildenbrand
  2025-06-30 11:08       ` Lorenzo Stoakes
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30  9:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 27.06.25 21:04, Lorenzo Stoakes wrote:
> On Fri, Jun 27, 2025 at 01:55:10PM +0200, David Hildenbrand wrote:
>> Instead, let's just allow for specifying through flags whether we want
>> to have bits merged into the original PTE.
>>
>> For the madvise() case, simplify by having only a single parameter for
>> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
>> dirty bit is not required, but also not harmful. This code is not that
>> performance critical after all to really force all micro-optimizations.
>>
>> As we now have two pte_t * parameters, use PageTable() to make sure we
>> are actually given a pointer at a copy of the PTE, not a pointer into
>> an actual page table.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Overall a really nice cleanup! Just some comments below.
> 
>> ---
>>   mm/internal.h | 58 +++++++++++++++++++++++++++++++--------------------
>>   mm/madvise.c  | 26 +++++------------------
>>   mm/memory.c   |  8 ++-----
>>   mm/util.c     |  2 +-
>>   4 files changed, 43 insertions(+), 51 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 6000b683f68ee..fe69e21b34a24 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
>>   /* Compare PTEs honoring the soft-dirty bit. */
>>   #define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
>>
>> +/*
>> + * Merge PTE write bits: if any PTE in the batch is writable, modify the
>> + * PTE at @ptentp to be writable.
>> + */
>> +#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
>> +
>> +/*
>> + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
>> + * modify the PTE at @ptentp to be young or dirty, respectively.
>> + */
>> +#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
>> +
>>   static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   {
>>   	if (!(flags & FPB_HONOR_DIRTY))
>> @@ -220,16 +232,11 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   /**
>>    * folio_pte_batch_ext - detect a PTE batch for a large folio
>>    * @folio: The large folio to detect a PTE batch for.
>> + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>>    * @ptep: Page table pointer for the first entry.
>> - * @pte: Page table entry for the first page.
>> + * @ptentp: Pointer at a copy of the first page table entry.
> 
> This seems weird to me, I know it's a pointer to a copy of the PTE, essentially
> replacing the pte param from before, but now it's also an output value?
> Shouldn't this be made clear?

As you spotted, I make that clear below and for each and every flag that 
someone would set that would affect it.

> 
> I know it's a pain and churn but if this is now meant to be an output var we
> should probably make it the last param too.
> 
> At least needs an (output) or something here.

Well, it's an input+output parameter.

"Pointer at a copy of the first page table entry that might be modified 
depending on @flags." is a bit mouthful, but maybe clearer than just 
"output".

[...]

>>   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>>   	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>> +	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
> 
> Hm so if !virt_addr_valid(ptentp) we're ok? :P

I had the same question when writing that. Obviously, 
PageTable(virt_to_page(ptentp)) faulted when called on something on the 
stack. (ran into that ... :) )

Maybe "VM_WARN_ON(virt_addr_valid(ptentp));" would work as well, but I 
am not sure how that function behaves on all architectures ...

> I also think a quick comment here
> would help, the commit message explains it but glancing at this I'd be confused.
> 
> Something like:
> 
> /* Ensure this is a pointer to a copy not a pointer into a page table. */

Yes, makes sense.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-30  9:19     ` David Hildenbrand
@ 2025-06-30 10:41       ` Lorenzo Stoakes
  2025-06-30 10:54         ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 10:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Mon, Jun 30, 2025 at 11:19:13AM +0200, David Hildenbrand wrote:
> On 27.06.25 20:48, Lorenzo Stoakes wrote:
> > On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
> > > Many users (including upcoming ones) don't really need the flags etc,
> > > and can live with a function call.
> > >
> > > So let's provide a basic, non-inlined folio_pte_batch().
> >
> > Hm, but why non-inlined, when it invokes an inlined function? Seems odd no?
>
> We want to always generate a function that uses as little runtime checks as
> possible. Essentially, optimize out the "flags" as much as possible.
>
> In case of folio_pte_batch(), where we won't use any flags, any checks will
> be optimized out by the compiler.
>
> So we get a single, specialized, non-inlined function.

I mean I suppose code bloat is a thing too. Would the compiler not also optimise
out checks if it were inlined though?

>
> >
> > >
> > > In zap_present_ptes(), where we care about performance, the compiler
> > > already seem to generate a call to a common inlined folio_pte_batch()
> > > variant, shared with fork() code. So calling the new non-inlined variant
> > > should not make a difference.
> > >
> > > While at it, drop the "addr" parameter that is unused.
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > Other than the query above + nit on name below, this is really nice!
> >
> > > ---
> > >   mm/internal.h  | 11 ++++++++---
> > >   mm/madvise.c   |  4 ++--
> > >   mm/memory.c    |  6 ++----
> > >   mm/mempolicy.c |  3 +--
> > >   mm/mlock.c     |  3 +--
> > >   mm/mremap.c    |  3 +--
> > >   mm/rmap.c      |  3 +--
> > >   mm/util.c      | 29 +++++++++++++++++++++++++++++
> > >   8 files changed, 45 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index ca6590c6d9eab..6000b683f68ee 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >   }
> > >
> > >   /**
> > > - * folio_pte_batch - detect a PTE batch for a large folio
> > > + * folio_pte_batch_ext - detect a PTE batch for a large folio
> > >    * @folio: The large folio to detect a PTE batch for.
> > > - * @addr: The user virtual address the first page is mapped at.
> > >    * @ptep: Page table pointer for the first entry.
> > >    * @pte: Page table entry for the first page.
> > >    * @max_nr: The maximum number of table entries to consider.
> > > @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >    * must be limited by the caller so scanning cannot exceed a single VMA and
> > >    * a single page table.
> > >    *
> > > + * This function will be inlined to optimize based on the input parameters;
> > > + * consider using folio_pte_batch() instead if applicable.
> > > + *
> > >    * Return: the number of table entries in the batch.
> > >    */
> > > -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> > > +static inline unsigned int folio_pte_batch_ext(struct folio *folio,
> > >   		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
> > >   		bool *any_writable, bool *any_young, bool *any_dirty)
> >
> > Sorry this is really really annoying feedback :P but _ext() makes me think of
> > page_ext and ugh :))
> >
> > Wonder if __folio_pte_batch() is better?
> >
> > This is obviously, not a big deal (TM)
>
> Obviously, I had that as part of the development, and decided against it at
> some point. :)
>
> Yeah, _ext() is not common in MM yet, in contrast to other subsystems. The
> only user is indeed page_ext. On arm we seem to have set_pte_ext(). But it's
> really "page_ext", that's the problematic part, not "_ext" :P
>
> No strong opinion, but I tend to dislike here "__", because often it means
> "internal helper you're not supposed to used", which isn't really the case
> here.

Yeah, and of course we break this convention all over the place :)

Maybe folio_pte_batch_flags()?

>
> E.g.,
>
> alloc_frozen_pages() -> alloc_frozen_pages_noprof() ->
> __alloc_frozen_pages_noprof()
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-30 10:41       ` Lorenzo Stoakes
@ 2025-06-30 10:54         ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30 10:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 30.06.25 12:41, Lorenzo Stoakes wrote:
> On Mon, Jun 30, 2025 at 11:19:13AM +0200, David Hildenbrand wrote:
>> On 27.06.25 20:48, Lorenzo Stoakes wrote:
>>> On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
>>>> Many users (including upcoming ones) don't really need the flags etc,
>>>> and can live with a function call.
>>>>
>>>> So let's provide a basic, non-inlined folio_pte_batch().
>>>
>>> Hm, but why non-inlined, when it invokes an inlined function? Seems odd no?
>>
>> We want to always generate a function that uses as little runtime checks as
>> possible. Essentially, optimize out the "flags" as much as possible.
>>
>> In case of folio_pte_batch(), where we won't use any flags, any checks will
>> be optimized out by the compiler.
>>
>> So we get a single, specialized, non-inlined function.
> 
> I mean I suppose code bloat is a thing too. Would the compiler not also optimise
> out checks if it were inlined though?

The compiler will optimize all (most) inlined variants, yes.

But we will end up creating the same optimized variant for each 
folio_pte_batch() user before this change.

And as Andrew put it

"And why the heck is folio_pte_batch() inlined?  It's larger then my 
first hard disk" [1]

I should probably add a suggested-by + link to that discussion.

[1] 
https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/

[...]

>>
>> Obviously, I had that as part of the development, and decided against it at
>> some point. :)
>>
>> Yeah, _ext() is not common in MM yet, in contrast to other subsystems. The
>> only user is indeed page_ext. On arm we seem to have set_pte_ext(). But it's
>> really "page_ext", that's the problematic part, not "_ext" :P
>>
>> No strong opinion, but I tend to dislike here "__", because often it means
>> "internal helper you're not supposed to used", which isn't really the case
>> here.
> 
> Yeah, and of course we break this convention all over the place :)
> 
> Maybe folio_pte_batch_flags()?

Works for me as well.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-30  9:24               ` David Hildenbrand
@ 2025-06-30 10:57                 ` Ryan Roberts
  2025-06-30 11:01                   ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Ryan Roberts @ 2025-06-30 10:57 UTC (permalink / raw)
  To: David Hildenbrand, Dev Jain, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 30/06/2025 10:24, David Hildenbrand wrote:
> On 30.06.25 11:18, Ryan Roberts wrote:
>> On 30/06/2025 10:08, David Hildenbrand wrote:
>>> On 30.06.25 11:04, Ryan Roberts wrote:
>>>> On 30/06/2025 04:34, Dev Jain wrote:
>>>>>
>>>>> On 29/06/25 2:30 am, David Hildenbrand wrote:
>>>>>> On 28.06.25 05:37, Dev Jain wrote:
>>>>>>>
>>>>>>> On 27/06/25 5:25 pm, David Hildenbrand wrote:
>>>>>>>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>>>>>>>
>>>>>>>> With this change, most callers don't have to pass any flags.
>>>>>>>>
>>>>>>>> No functional change intended.
>>>>>>>
>>>>>>> FWIW I had proposed this kind of change earlier to Ryan (CCed) and
>>>>>>> he pointed out: "Doesn't that argument apply in reverse if you want
>>>>>>> to ignore something new in future?
>>>>>>>
>>>>>>> By default we are comparing all the bits in the pte when determining the
>>>>>>> batch.
>>>>>>> The flags request to ignore certain bits.
>>>>>>
>>>>>> That statement is not true: as default we ignore the write and young bit. And
>>>>>> we don't have flags for that ;)
>>>>>>
>>>>>> Now we also ignore the dirty and soft-dity bit as default, unless told not to
>>>>>> do that by one very specific caller.
>>>>>>
>>>>>>> If we want to ignore extra bits in
>>>>>>> future, we add new flags and the existing callers don't need to be updated.
>>>>>>
>>>>>> What stops you from using FPB_IGNORE_* for something else in the future?
>>>>>>
>>>>>> As a side note, there are not that many relevant PTE bits to worry about in
>>>>>> the near future ;)
>>>>>>
>>>>>> I mean, uffd-wp, sure, ... and before we add a FPB_HONOR_UFFD_WP to all users
>>>>>> to be safe (and changing the default to ignore), you could add a
>>>>>> FPB_IGNORE_UFFD_WP first, to then check who really can tolerate just ignoring
>>>>>> it (most of them, I assume).
>>>>> I agree.
>>>>
>>>> Meh. Personally I think if you start mixing HONOR and IGNORE flags, it becomes
>>>> very confusing to work out what is being checked for and what is not. I
>>>> stand by
>>>> my original view. But yeah, writable and young confuse it a bit... How about
>>>> generalizing by explicitly requiring IGNORE flags for write and young, then
>>>> also
>>>> create a flags macro for the common case?
>>>>
>>>> #define FPB_IGNORE_COMMON (FPB_IGNORE_WRITE | FPB_IGNORE_YOUNG |    \
>>>>                 FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY)
>>>>
>>>> It's not a hill I'm going to die on though...
>>>
>>> How about we make this function simpler, not more complicated? ;)
>>
>> I think here we both have different views of what is simpler... You are trying
>> to optimize for the callers writing less code. I'm trying to optimize for the
>> reader to be able to easily determine what the function will do for a given
>> caller.
> 
> See patch number #3: I want the default function -- folio_pte_batch() -- to not
> have any flags at all.
> 
> And I don't want to achieve that by internally using flags when calling
> folio_pte_batch_ext().
> 
> If you don't specify flags (folio_pte_batch()), behave just as if calling
> folio_pte_batch_ext() without flags. Anything else would be more confusing IMHO.

OK, sorry I don't have the context of your actual series... I was just trying to
defend what my quote that Dev sent. I'll go take a look at your series.

> 
> I agree that mixing HONOR and IGNORE is not a good idea. But then, it's really
> only uffd-wp that still could be batched, and likely we want it to be the
> default, and respect/honor/whatever instead in the cases where we really have to.
> 
> (If we really want to go down that path and batch it :) )

FWIW, I think we will need to honor the write bit for Dev's mprotect series (I
just sent review comments against his v4). So if you want most callers to just
call folio_pte_batch() and that will ignore the write bit, then I guess
folio_pte_batch_ext() will have to accept a FPB_HONOR_WRITE bit? Which I guess
aligns with what you are doing here to make all the flags HONOR.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-30 10:57                 ` Ryan Roberts
@ 2025-06-30 11:01                   ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30 11:01 UTC (permalink / raw)
  To: Ryan Roberts, Dev Jain, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 30.06.25 12:57, Ryan Roberts wrote:
> On 30/06/2025 10:24, David Hildenbrand wrote:
>> On 30.06.25 11:18, Ryan Roberts wrote:
>>> On 30/06/2025 10:08, David Hildenbrand wrote:
>>>> On 30.06.25 11:04, Ryan Roberts wrote:
>>>>> On 30/06/2025 04:34, Dev Jain wrote:
>>>>>>
>>>>>> On 29/06/25 2:30 am, David Hildenbrand wrote:
>>>>>>> On 28.06.25 05:37, Dev Jain wrote:
>>>>>>>>
>>>>>>>> On 27/06/25 5:25 pm, David Hildenbrand wrote:
>>>>>>>>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>>>>>>>>
>>>>>>>>> With this change, most callers don't have to pass any flags.
>>>>>>>>>
>>>>>>>>> No functional change intended.
>>>>>>>>
>>>>>>>> FWIW I had proposed this kind of change earlier to Ryan (CCed) and
>>>>>>>> he pointed out: "Doesn't that argument apply in reverse if you want
>>>>>>>> to ignore something new in future?
>>>>>>>>
>>>>>>>> By default we are comparing all the bits in the pte when determining the
>>>>>>>> batch.
>>>>>>>> The flags request to ignore certain bits.
>>>>>>>
>>>>>>> That statement is not true: as default we ignore the write and young bit. And
>>>>>>> we don't have flags for that ;)
>>>>>>>
>>>>>>> Now we also ignore the dirty and soft-dity bit as default, unless told not to
>>>>>>> do that by one very specific caller.
>>>>>>>
>>>>>>>> If we want to ignore extra bits in
>>>>>>>> future, we add new flags and the existing callers don't need to be updated.
>>>>>>>
>>>>>>> What stops you from using FPB_IGNORE_* for something else in the future?
>>>>>>>
>>>>>>> As a side note, there are not that many relevant PTE bits to worry about in
>>>>>>> the near future ;)
>>>>>>>
>>>>>>> I mean, uffd-wp, sure, ... and before we add a FPB_HONOR_UFFD_WP to all users
>>>>>>> to be safe (and changing the default to ignore), you could add a
>>>>>>> FPB_IGNORE_UFFD_WP first, to then check who really can tolerate just ignoring
>>>>>>> it (most of them, I assume).
>>>>>> I agree.
>>>>>
>>>>> Meh. Personally I think if you start mixing HONOR and IGNORE flags, it becomes
>>>>> very confusing to work out what is being checked for and what is not. I
>>>>> stand by
>>>>> my original view. But yeah, writable and young confuse it a bit... How about
>>>>> generalizing by explicitly requiring IGNORE flags for write and young, then
>>>>> also
>>>>> create a flags macro for the common case?
>>>>>
>>>>> #define FPB_IGNORE_COMMON (FPB_IGNORE_WRITE | FPB_IGNORE_YOUNG |    \
>>>>>                  FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY)
>>>>>
>>>>> It's not a hill I'm going to die on though...
>>>>
>>>> How about we make this function simpler, not more complicated? ;)
>>>
>>> I think here we both have different views of what is simpler... You are trying
>>> to optimize for the callers writing less code. I'm trying to optimize for the
>>> reader to be able to easily determine what the function will do for a given
>>> caller.
>>
>> See patch number #3: I want the default function -- folio_pte_batch() -- to not
>> have any flags at all.
>>
>> And I don't want to achieve that by internally using flags when calling
>> folio_pte_batch_ext().
>>
>> If you don't specify flags (folio_pte_batch()), behave just as if calling
>> folio_pte_batch_ext() without flags. Anything else would be more confusing IMHO.
> 
> OK, sorry I don't have the context of your actual series... I was just trying to
> defend what my quote that Dev sent. I'll go take a look at your series.

If we'd not do the split and always provide flags + non-inlined variant, 
then I'd agree.

> 
>>
>> I agree that mixing HONOR and IGNORE is not a good idea. But then, it's really
>> only uffd-wp that still could be batched, and likely we want it to be the
>> default, and respect/honor/whatever instead in the cases where we really have to.
>>
>> (If we really want to go down that path and batch it :) )
> 
> FWIW, I think we will need to honor the write bit for Dev's mprotect series (I
> just sent review comments against his v4). So if you want most callers to just
> call folio_pte_batch() and that will ignore the write bit, then I guess
> folio_pte_batch_ext() will have to accept a FPB_HONOR_WRITE bit? Which I guess
> aligns with what you are doing here to make all the flags HONOR.

I have to take a look at that one. Maybe it's sufficient to know if any 
PTE was writable, just like we did with fork().

But yes, anybody who requires more advanced handling (flags) shall use 
the extended variant -- however that one will be called.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-30  9:32     ` David Hildenbrand
@ 2025-06-30 11:08       ` Lorenzo Stoakes
  2025-06-30 11:16         ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 11:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Mon, Jun 30, 2025 at 11:32:40AM +0200, David Hildenbrand wrote:
> On 27.06.25 21:04, Lorenzo Stoakes wrote:
> > On Fri, Jun 27, 2025 at 01:55:10PM +0200, David Hildenbrand wrote:
> > > Instead, let's just allow for specifying through flags whether we want
> > > to have bits merged into the original PTE.
> > >
> > > For the madvise() case, simplify by having only a single parameter for
> > > merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
> > > dirty bit is not required, but also not harmful. This code is not that
> > > performance critical after all to really force all micro-optimizations.
> > >
> > > As we now have two pte_t * parameters, use PageTable() to make sure we
> > > are actually given a pointer at a copy of the PTE, not a pointer into
> > > an actual page table.
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > Overall a really nice cleanup! Just some comments below.
> >
> > > ---
> > >   mm/internal.h | 58 +++++++++++++++++++++++++++++++--------------------
> > >   mm/madvise.c  | 26 +++++------------------
> > >   mm/memory.c   |  8 ++-----
> > >   mm/util.c     |  2 +-
> > >   4 files changed, 43 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 6000b683f68ee..fe69e21b34a24 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
> > >   /* Compare PTEs honoring the soft-dirty bit. */
> > >   #define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
> > >
> > > +/*
> > > + * Merge PTE write bits: if any PTE in the batch is writable, modify the
> > > + * PTE at @ptentp to be writable.
> > > + */
> > > +#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
> > > +
> > > +/*
> > > + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
> > > + * modify the PTE at @ptentp to be young or dirty, respectively.
> > > + */
> > > +#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
> > > +
> > >   static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >   {
> > >   	if (!(flags & FPB_HONOR_DIRTY))
> > > @@ -220,16 +232,11 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >   /**
> > >    * folio_pte_batch_ext - detect a PTE batch for a large folio
> > >    * @folio: The large folio to detect a PTE batch for.
> > > + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
> > >    * @ptep: Page table pointer for the first entry.
> > > - * @pte: Page table entry for the first page.
> > > + * @ptentp: Pointer at a copy of the first page table entry.
> >
> > This seems weird to me, I know it's a pointer to a copy of the PTE, essentially
> > replacing the pte param from before, but now it's also an output value?
> > Shouldn't this be made clear?
>
> As you spotted, I make that clear below and for each and every flag that
> someone would set that would affect it.

I still think this needs to be made clearer. As a reviewer I had no idea what on
earth this parameter was for honestly without having to think quite a bit (and I
try to avoid that these days :P).

And as a user of this function I'd be like 'weird it needs a copy'.

See below...

>
> >
> > I know it's a pain and churn but if this is now meant to be an output var we
> > should probably make it the last param too.
> >
> > At least needs an (output) or something here.
>
> Well, it's an input+output parameter.
>
> "Pointer at a copy of the first page table entry that might be modified
> depending on @flags." is a bit mouthful, but maybe clearer than just
> "output".

Yeah but even then it's not clear :)

So yeah it is both, and we get into vague realms here, actually normally
'output' means we never read it either... ugh god haha.

So maybe:

@ptentp: Pointer to a COPY of the first page table entry whose flags we update
         if appropriate.

And then update the description of folio_pte_batch_flags() both the brief one to
add 'updates ptentp to set flags if appropriate' and maybe in the larger
description bit.

Then we're as clear as we can be I think.

>
> [...]
>
> > >   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> > >   	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> > >   	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
> > > +	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
> >
> > Hm so if !virt_addr_valid(ptentp) we're ok? :P
>
> I had the same question when writing that. Obviously,
> PageTable(virt_to_page(ptentp)) faulted when called on something on the
> stack. (ran into that ... :) )
>
> Maybe "VM_WARN_ON(virt_addr_valid(ptentp));" would work as well, but I am
> not sure how that function behaves on all architectures ...

Well you wouldn't want to limit callers to only working on stack values...

I guess just add a comment, or rather update the the one below to something like:

	/*
	 * Ensure this is a pointer to a copy not a pointer into a page table.
	 * If this is a stack value, it won't be a valid virtual address, but that's
	 * fine because it also cannot be pointing into the page table.
	 */

Which would clarify.

>
> > I also think a quick comment here
> > would help, the commit message explains it but glancing at this I'd be confused.
> >
> > Something like:
> >
> > /* Ensure this is a pointer to a copy not a pointer into a page table. */
>
> Yes, makes sense.
>
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-30 11:08       ` Lorenzo Stoakes
@ 2025-06-30 11:16         ` David Hildenbrand
  2025-06-30 11:18           ` Lorenzo Stoakes
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30 11:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo


> 
> @ptentp: Pointer to a COPY of the first page table entry whose flags we update
>           if appropriate.

+ * @ptentp: Pointer to a COPY of the first page table entry whose flags this
+ *         function updates based on @flags if appropriate.


> 
> And then update the description of folio_pte_batch_flags() both the brief one to
> add 'updates ptentp to set flags if appropriate' and maybe in the larger
> description bit.

Not in the brief one; the other description, including the excessive parameter doc
will be enough.

FWIW, I briefly though passing in an arg struct, or returning the pte instead (and returning
the nr_pages using a parameter), but hated both. More than these two stupid pte*.

> 
>>
>> [...]
>>
>>>>    	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>>    	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>>>>    	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>>>> +	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
>>>
>>> Hm so if !virt_addr_valid(ptentp) we're ok? :P
>>
>> I had the same question when writing that. Obviously,
>> PageTable(virt_to_page(ptentp)) faulted when called on something on the
>> stack. (ran into that ... :) )
>>
>> Maybe "VM_WARN_ON(virt_addr_valid(ptentp));" would work as well, but I am
>> not sure how that function behaves on all architectures ...
> 
> Well you wouldn't want to limit callers to only working on stack values...
> 
> I guess just add a comment, or rather update the the one below to something like:
> 
> 	/*
> 	 * Ensure this is a pointer to a copy not a pointer into a page table.
> 	 * If this is a stack value, it won't be a valid virtual address, but that's
> 	 * fine because it also cannot be pointing into the page table.
> 	 */
> 
> Which would clarify.

Yes, I'll use that, thanks.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-30 11:16         ` David Hildenbrand
@ 2025-06-30 11:18           ` Lorenzo Stoakes
  2025-06-30 11:21             ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 11:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On Mon, Jun 30, 2025 at 01:16:52PM +0200, David Hildenbrand wrote:
>
> >
> > @ptentp: Pointer to a COPY of the first page table entry whose flags we update
> >           if appropriate.
>
> + * @ptentp: Pointer to a COPY of the first page table entry whose flags this
> + *         function updates based on @flags if appropriate.
>
>
> >
> > And then update the description of folio_pte_batch_flags() both the brief one to
> > add 'updates ptentp to set flags if appropriate' and maybe in the larger
> > description bit.
>
> Not in the brief one; the other description, including the excessive parameter doc
> will be enough.

That works for me! Let's not hold this up on trivia.

>
> FWIW, I briefly though passing in an arg struct, or returning the pte instead (and returning
> the nr_pages using a parameter), but hated both. More than these two stupid pte*.

Well I love help structs bro so you know I'd love that ;)

This version is fine with the updated param doc though!

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-30 11:18           ` Lorenzo Stoakes
@ 2025-06-30 11:21             ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30 11:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo

On 30.06.25 13:18, Lorenzo Stoakes wrote:
> On Mon, Jun 30, 2025 at 01:16:52PM +0200, David Hildenbrand wrote:
>>
>>>
>>> @ptentp: Pointer to a COPY of the first page table entry whose flags we update
>>>            if appropriate.
>>
>> + * @ptentp: Pointer to a COPY of the first page table entry whose flags this
>> + *         function updates based on @flags if appropriate.
>>
>>
>>>
>>> And then update the description of folio_pte_batch_flags() both the brief one to
>>> add 'updates ptentp to set flags if appropriate' and maybe in the larger
>>> description bit.
>>
>> Not in the brief one; the other description, including the excessive parameter doc
>> will be enough.
> 
> That works for me! Let's not hold this up on trivia.
> 
>>
>> FWIW, I briefly though passing in an arg struct, or returning the pte instead (and returning
>> the nr_pages using a parameter), but hated both. More than these two stupid pte*.
> 
> Well I love help structs bro so you know I'd love that ;)

Yeah, I was more annoyed by a possible the folio_pte_batch() vs. 
folio_pte_batch_ext /_flags() inconsistency.

I'll think about this once more ... :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-29  8:59         ` Mike Rapoport
@ 2025-06-30 13:47           ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-06-30 13:47 UTC (permalink / raw)
  To: Mike Rapoport, Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Pedro Falcato,
	Rik van Riel, Harry Yoo

On 29.06.25 10:59, Mike Rapoport wrote:
> On Fri, Jun 27, 2025 at 05:33:06PM +0100, Lorenzo Stoakes wrote:
>> On Fri, Jun 27, 2025 at 06:30:13PM +0200, David Hildenbrand wrote:
>>> On 27.06.25 18:28, Lorenzo Stoakes wrote:
>>>> On Fri, Jun 27, 2025 at 01:55:07PM +0200, David Hildenbrand wrote:
>>>>> Honoring these PTE bits is the exception, so let's invert the meaning.
>>>>>
>>>>> With this change, most callers don't have to pass any flags.
>>>>>
>>>>> No functional change intended.
>>>>>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>
>>>> This is a nice change, it removes a lot of code I really didn't enjoy
>>>> looking at for introducing these flags all over the place.
>>>>
>>>> But a nit on the naming below, I'm not a fan of 'honor' here :)
>>>>
>>>>> ---
>>>>>    mm/internal.h  | 16 ++++++++--------
>>>>>    mm/madvise.c   |  3 +--
>>>>>    mm/memory.c    | 11 +++++------
>>>>>    mm/mempolicy.c |  4 +---
>>>>>    mm/mlock.c     |  3 +--
>>>>>    mm/mremap.c    |  3 +--
>>>>>    mm/rmap.c      |  3 +--
>>>>>    7 files changed, 18 insertions(+), 25 deletions(-)
>>>>>
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index e84217e27778d..9690c75063881 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
>>>>>    /* Flags for folio_pte_batch(). */
>>>>>    typedef int __bitwise fpb_t;
>>>>>
>>>>> -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
>>>>> -#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
>>>>> +/* Compare PTEs honoring the dirty bit. */
>>>>> +#define FPB_HONOR_DIRTY		((__force fpb_t)BIT(0))
>>>>
>>>> Hm not to be petty but... :)
>>>>
>>>> I'm not sure I find 'honor' very clear here. Ignore is very clear, 'honor' (God
>>>> the British English in me wants to say honour here but stipp :P) doesn't
>>>> necessarily tell you what is going to happen.
>>>>
>>>> Perhaps PROPAGATE? or OBEY?
>>>
>>> RESPECT? :)
> 
> DONT_IGNORE ;-)

Well, yes, that would be an obvious option as well, but I think I'll go 
with "RESPECT" for now.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
                     ` (2 preceding siblings ...)
  2025-06-28  3:37   ` Dev Jain
@ 2025-06-30 14:35   ` Zi Yan
  2025-07-02  8:31   ` Oscar Salvador
  4 siblings, 0 replies; 58+ messages in thread
From: Zi Yan @ 2025-06-30 14:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 27 Jun 2025, at 7:55, David Hildenbrand wrote:

> Honoring these PTE bits is the exception, so let's invert the meaning.
>
> With this change, most callers don't have to pass any flags.
>
> No functional change intended.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h  | 16 ++++++++--------
>  mm/madvise.c   |  3 +--
>  mm/memory.c    | 11 +++++------
>  mm/mempolicy.c |  4 +---
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  7 files changed, 18 insertions(+), 25 deletions(-)
>

LGTM. I assume the callers no longer pass any flags will be converted
to use folio_pte_batch() (no flags) later. I will keep reading.

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-06-27 11:55 ` [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
  2025-06-27 13:58   ` Lance Yang
  2025-06-27 16:51   ` Lorenzo Stoakes
@ 2025-06-30 17:40   ` Zi Yan
  2025-07-02  8:42   ` Oscar Salvador
  3 siblings, 0 replies; 58+ messages in thread
From: Zi Yan @ 2025-06-30 17:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 27 Jun 2025, at 7:55, David Hildenbrand wrote:

> Let's clean up a bit:
>
> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
>
> (2) Let's switch to "unsigned int" for everything
>
> (3) We can simplify the code by leaving the pte unchanged after the
>     pte_same() check.
>
> (4) Clarify that we should never exceed a single VMA; it indicates a
>     problem in the caller.
>
> No functional change intended.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)
>

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 11:55 ` [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext() David Hildenbrand
  2025-06-27 14:19   ` Lance Yang
  2025-06-27 18:48   ` Lorenzo Stoakes
@ 2025-06-30 17:45   ` Zi Yan
  2025-07-02  9:02   ` Oscar Salvador
  2025-07-02  9:09   ` Oscar Salvador
  4 siblings, 0 replies; 58+ messages in thread
From: Zi Yan @ 2025-06-30 17:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 27 Jun 2025, at 7:55, David Hildenbrand wrote:

> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
>
> So let's provide a basic, non-inlined folio_pte_batch().
>
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
>
> While at it, drop the "addr" parameter that is unused.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h  | 11 ++++++++---
>  mm/madvise.c   |  4 ++--
>  mm/memory.c    |  6 ++----
>  mm/mempolicy.c |  3 +--
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  mm/util.c      | 29 +++++++++++++++++++++++++++++
>  8 files changed, 45 insertions(+), 17 deletions(-)
>
Nice cleanup. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-27 11:55 ` [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
  2025-06-27 14:34   ` Lance Yang
  2025-06-27 19:04   ` Lorenzo Stoakes
@ 2025-06-30 17:59   ` Zi Yan
  2025-07-02  9:08     ` David Hildenbrand
  2 siblings, 1 reply; 58+ messages in thread
From: Zi Yan @ 2025-06-30 17:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 27 Jun 2025, at 7:55, David Hildenbrand wrote:

> Instead, let's just allow for specifying through flags whether we want
> to have bits merged into the original PTE.
>
> For the madvise() case, simplify by having only a single parameter for
> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
> dirty bit is not required, but also not harmful. This code is not that
> performance critical after all to really force all micro-optimizations.
>
> As we now have two pte_t * parameters, use PageTable() to make sure we
> are actually given a pointer at a copy of the PTE, not a pointer into
> an actual page table.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h | 58 +++++++++++++++++++++++++++++++--------------------
>  mm/madvise.c  | 26 +++++------------------
>  mm/memory.c   |  8 ++-----
>  mm/util.c     |  2 +-
>  4 files changed, 43 insertions(+), 51 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 6000b683f68ee..fe69e21b34a24 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
>  /* Compare PTEs honoring the soft-dirty bit. */
>  #define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
>
> +/*
> + * Merge PTE write bits: if any PTE in the batch is writable, modify the
> + * PTE at @ptentp to be writable.
> + */
> +#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
> +
> +/*
> + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
> + * modify the PTE at @ptentp to be young or dirty, respectively.
> + */
> +#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
> +
>  static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  {
>  	if (!(flags & FPB_HONOR_DIRTY))
> @@ -220,16 +232,11 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  /**
>   * folio_pte_batch_ext - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
> + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>   * @ptep: Page table pointer for the first entry.
> - * @pte: Page table entry for the first page.
> + * @ptentp: Pointer at a copy of the first page table entry.
>   * @max_nr: The maximum number of table entries to consider.
>   * @flags: Flags to modify the PTE batch semantics.
> - * @any_writable: Optional pointer to indicate whether any entry except the
> - *		  first one is writable.
> - * @any_young: Optional pointer to indicate whether any entry except the
> - *		  first one is young.
> - * @any_dirty: Optional pointer to indicate whether any entry except the
> - *		  first one is dirty.
>   *
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>   * pages of the same large folio in a single VMA and a single page table.
> @@ -242,28 +249,26 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * must be limited by the caller so scanning cannot exceed a single VMA and
>   * a single page table.
>   *
> + * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
> + * be modified.
> + *
>   * This function will be inlined to optimize based on the input parameters;
>   * consider using folio_pte_batch() instead if applicable.
>   *
>   * Return: the number of table entries in the batch.
>   */
>  static inline unsigned int folio_pte_batch_ext(struct folio *folio,
> -		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
> -		bool *any_writable, bool *any_young, bool *any_dirty)
> +		struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
> +		unsigned int max_nr, fpb_t flags)
>  {
> +	bool any_writable = false, any_young = false, any_dirty = false;
> +	pte_t expected_pte, pte = *ptentp;
>  	unsigned int nr, cur_nr;
> -	pte_t expected_pte;
> -
> -	if (any_writable)
> -		*any_writable = false;
> -	if (any_young)
> -		*any_young = false;
> -	if (any_dirty)
> -		*any_dirty = false;
>
>  	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>  	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>  	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
> +	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));

Why not just VM_WARN_ON(!pte_same(*ptep, *ptentp)) ?

ptep points to the first page table entry and ptentp points to the copy of it.
I assume ptep should point to a valid page table entry to begin with.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_*
  2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
                     ` (3 preceding siblings ...)
  2025-06-30 14:35   ` Zi Yan
@ 2025-07-02  8:31   ` Oscar Salvador
  4 siblings, 0 replies; 58+ messages in thread
From: Oscar Salvador @ 2025-07-02  8:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 01:55:07PM +0200, David Hildenbrand wrote:
> Honoring these PTE bits is the exception, so let's invert the meaning.
> 
> With this change, most callers don't have to pass any flags.
> 
> No functional change intended.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

I'm not gonna jump into a dialectics battle :-), so whatever ends up being the
FPB_flag:

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-06-27 11:55 ` [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
                     ` (2 preceding siblings ...)
  2025-06-30 17:40   ` Zi Yan
@ 2025-07-02  8:42   ` Oscar Salvador
  2025-07-02  8:48     ` David Hildenbrand
  3 siblings, 1 reply; 58+ messages in thread
From: Oscar Salvador @ 2025-07-02  8:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
> Let's clean up a bit:
> 
> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
> 
> (2) Let's switch to "unsigned int" for everything
> 
> (3) We can simplify the code by leaving the pte unchanged after the
>     pte_same() check.
> 
> (4) Clarify that we should never exceed a single VMA; it indicates a
>     problem in the caller.
> 
> No functional change intended.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Hi David :-),

I have to confess that I fell in the same trap as Lorenzo wrt.
__pte_batch_clear_ignored changing the pte value.
So I'm not sure if it would be nice to place a little comment in
__pte_batch_clear_ignored claryfing that pte's value remains unchanged ?

Reviewed-by: Oscar Salvador <osalvador@suse.de>

 
-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-07-02  8:42   ` Oscar Salvador
@ 2025-07-02  8:48     ` David Hildenbrand
  2025-07-02  8:51       ` Lorenzo Stoakes
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-07-02  8:48 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 02.07.25 10:42, Oscar Salvador wrote:
> On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
>> Let's clean up a bit:
>>
>> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
>>
>> (2) Let's switch to "unsigned int" for everything
>>
>> (3) We can simplify the code by leaving the pte unchanged after the
>>      pte_same() check.
>>
>> (4) Clarify that we should never exceed a single VMA; it indicates a
>>      problem in the caller.
>>
>> No functional change intended.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Hi David :-),
> 
> I have to confess that I fell in the same trap as Lorenzo wrt.
> __pte_batch_clear_ignored changing the pte value.
> So I'm not sure if it would be nice to place a little comment in
> __pte_batch_clear_ignored claryfing that pte's value remains unchanged ?

I mean, that's how all our pte modification functions work, really? :)

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-07-02  8:48     ` David Hildenbrand
@ 2025-07-02  8:51       ` Lorenzo Stoakes
  2025-07-02  9:00         ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-07-02  8:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, linux-kernel, linux-mm, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Wed, Jul 02, 2025 at 10:48:20AM +0200, David Hildenbrand wrote:
> On 02.07.25 10:42, Oscar Salvador wrote:
> > On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
> > > Let's clean up a bit:
> > >
> > > (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
> > >
> > > (2) Let's switch to "unsigned int" for everything
> > >
> > > (3) We can simplify the code by leaving the pte unchanged after the
> > >      pte_same() check.
> > >
> > > (4) Clarify that we should never exceed a single VMA; it indicates a
> > >      problem in the caller.
> > >
> > > No functional change intended.
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > Hi David :-),
> >
> > I have to confess that I fell in the same trap as Lorenzo wrt.
> > __pte_batch_clear_ignored changing the pte value.
> > So I'm not sure if it would be nice to place a little comment in
> > __pte_batch_clear_ignored claryfing that pte's value remains unchanged ?
>
> I mean, that's how all our pte modification functions work, really? :)
>
> Thanks!

I mean, it might be that me and Oscar are similarly 'challenged' in this
respect :P (high 5 Oscar!) but I think the issue here is that it's sort of
a compounded use, and in fact some functions do modify stuff, which is why
we end up with all the ptep ptent etc. fun.

Up to you re: comment, but I think maybe in cases where it's a reallly
compounded set of stuff it's potentially useful.

But obviously we still do do this all over the place elsewhere with no
comment...

>
> --
> Cheers,
>
> David / dhildenb
>
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-07-02  8:51       ` Lorenzo Stoakes
@ 2025-07-02  9:00         ` David Hildenbrand
  2025-07-02  9:08           ` Lorenzo Stoakes
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-07-02  9:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Oscar Salvador, linux-kernel, linux-mm, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 02.07.25 10:51, Lorenzo Stoakes wrote:
> On Wed, Jul 02, 2025 at 10:48:20AM +0200, David Hildenbrand wrote:
>> On 02.07.25 10:42, Oscar Salvador wrote:
>>> On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
>>>> Let's clean up a bit:
>>>>
>>>> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
>>>>
>>>> (2) Let's switch to "unsigned int" for everything
>>>>
>>>> (3) We can simplify the code by leaving the pte unchanged after the
>>>>       pte_same() check.
>>>>
>>>> (4) Clarify that we should never exceed a single VMA; it indicates a
>>>>       problem in the caller.
>>>>
>>>> No functional change intended.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> Hi David :-),
>>>
>>> I have to confess that I fell in the same trap as Lorenzo wrt.
>>> __pte_batch_clear_ignored changing the pte value.
>>> So I'm not sure if it would be nice to place a little comment in
>>> __pte_batch_clear_ignored claryfing that pte's value remains unchanged ?
>>
>> I mean, that's how all our pte modification functions work, really? :)
>>
>> Thanks!
> 
> I mean, it might be that me and Oscar are similarly 'challenged' in this
> respect :P (high 5 Oscar!) but I think the issue here is that it's sort of
> a compounded use, and in fact some functions do modify stuff, which is why
> we end up with all the ptep ptent etc. fun.
> 
> Up to you re: comment, but I think maybe in cases where it's a reallly
> compounded set of stuff it's potentially useful.
> 
> But obviously we still do do this all over the place elsewhere with no
> comment...

Well, if you are not passing in a *value* and not a pointer to a 
function, you would not expect for that *value* to change? :)

Yes, once we pass pointers it's different. Or when we're using weird macros.

Adding a comment that a function will not modify a value that is ... 
passed-by-value? Maybe it's just me that doesn't get why that should be 
particularly helpful :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 11:55 ` [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext() David Hildenbrand
                     ` (2 preceding siblings ...)
  2025-06-30 17:45   ` Zi Yan
@ 2025-07-02  9:02   ` Oscar Salvador
  2025-07-02  9:05     ` David Hildenbrand
  2025-07-02  9:09   ` Oscar Salvador
  4 siblings, 1 reply; 58+ messages in thread
From: Oscar Salvador @ 2025-07-02  9:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
> 
> So let's provide a basic, non-inlined folio_pte_batch().
> 
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
> 
> While at it, drop the "addr" parameter that is unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

So, let me see if I get this.

Before this, every other user that doesn't use the extra flags (dirty,
etc...) will end up with the code, optimized out, inlined within its body?

With this change, a single function, folio_pte_batch(), identical to folio_pte_batch_ext
but without the runtime checks for those arguments will be created (folio_pte_batch()),
and so the users of it won't have it inlined in their body ?


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-07-02  9:02   ` Oscar Salvador
@ 2025-07-02  9:05     ` David Hildenbrand
  2025-07-02  9:07       ` Oscar Salvador
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-07-02  9:05 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 02.07.25 11:02, Oscar Salvador wrote:
> On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
>> Many users (including upcoming ones) don't really need the flags etc,
>> and can live with a function call.
>>
>> So let's provide a basic, non-inlined folio_pte_batch().
>>
>> In zap_present_ptes(), where we care about performance, the compiler
>> already seem to generate a call to a common inlined folio_pte_batch()
>> variant, shared with fork() code. So calling the new non-inlined variant
>> should not make a difference.
>>
>> While at it, drop the "addr" parameter that is unused.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> So, let me see if I get this.
> 
> Before this, every other user that doesn't use the extra flags (dirty,
> etc...) will end up with the code, optimized out, inlined within its body?

Not necessarily inlined into the body (there might still be a function 
call, depending on what the compiler decides), but inlined into the 
object file and optimized by propagating constants.

> 
> With this change, a single function, folio_pte_batch(), identical to folio_pte_batch_ext
> but without the runtime checks for those arguments will be created (folio_pte_batch()),
> and so the users of it won't have it inlined in their body ?

Right. We have a single folio_pte_batch() that is optimized by 
propagating all constants. Instead of having one per object file, we 
have a single shared one.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-07-02  9:05     ` David Hildenbrand
@ 2025-07-02  9:07       ` Oscar Salvador
  2025-07-02  9:11         ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Oscar Salvador @ 2025-07-02  9:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Wed, Jul 02, 2025 at 11:05:17AM +0200, David Hildenbrand wrote:
> Not necessarily inlined into the body (there might still be a function call,
> depending on what the compiler decides), but inlined into the object file
> and optimized by propagating constants.

I see.

> > With this change, a single function, folio_pte_batch(), identical to folio_pte_batch_ext
> > but without the runtime checks for those arguments will be created (folio_pte_batch()),
> > and so the users of it won't have it inlined in their body ?
> 
> Right. We have a single folio_pte_batch() that is optimized by propagating
> all constants. Instead of having one per object file, we have a single
> shared one.

Alright, clear to me now, thanks for claryfing ;-)!

 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-07-02  9:00         ` David Hildenbrand
@ 2025-07-02  9:08           ` Lorenzo Stoakes
  2025-07-02  9:11             ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: Lorenzo Stoakes @ 2025-07-02  9:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, linux-kernel, linux-mm, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Wed, Jul 02, 2025 at 11:00:48AM +0200, David Hildenbrand wrote:
> On 02.07.25 10:51, Lorenzo Stoakes wrote:
> > On Wed, Jul 02, 2025 at 10:48:20AM +0200, David Hildenbrand wrote:
> > > On 02.07.25 10:42, Oscar Salvador wrote:
> > > > On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
> > > > > Let's clean up a bit:
> > > > >
> > > > > (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
> > > > >
> > > > > (2) Let's switch to "unsigned int" for everything
> > > > >
> > > > > (3) We can simplify the code by leaving the pte unchanged after the
> > > > >       pte_same() check.
> > > > >
> > > > > (4) Clarify that we should never exceed a single VMA; it indicates a
> > > > >       problem in the caller.
> > > > >
> > > > > No functional change intended.
> > > > >
> > > > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > >
> > > > Hi David :-),
> > > >
> > > > I have to confess that I fell in the same trap as Lorenzo wrt.
> > > > __pte_batch_clear_ignored changing the pte value.
> > > > So I'm not sure if it would be nice to place a little comment in
> > > > __pte_batch_clear_ignored claryfing that pte's value remains unchanged ?
> > >
> > > I mean, that's how all our pte modification functions work, really? :)
> > >
> > > Thanks!
> >
> > I mean, it might be that me and Oscar are similarly 'challenged' in this
> > respect :P (high 5 Oscar!) but I think the issue here is that it's sort of
> > a compounded use, and in fact some functions do modify stuff, which is why
> > we end up with all the ptep ptent etc. fun.
> >
> > Up to you re: comment, but I think maybe in cases where it's a reallly
> > compounded set of stuff it's potentially useful.
> >
> > But obviously we still do do this all over the place elsewhere with no
> > comment...
>
> Well, if you are not passing in a *value* and not a pointer to a function,
> you would not expect for that *value* to change? :)
>
> Yes, once we pass pointers it's different. Or when we're using weird macros.
>
> Adding a comment that a function will not modify a value that is ...
> passed-by-value? Maybe it's just me that doesn't get why that should be
> particularly helpful :)

I think the issue is that we've passed around 'pte' as value and pointer (and of
course, via macros...)  previously so that's the cause of the confusion, often.

This is why I really am a fan of us consistently saying ptep when passing a
pointer.

Anyway, I think on balance a comment isn't useful here, agreed! ;)

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-06-30 17:59   ` Zi Yan
@ 2025-07-02  9:08     ` David Hildenbrand
  2025-07-02  9:09       ` David Hildenbrand
  0 siblings, 1 reply; 58+ messages in thread
From: David Hildenbrand @ 2025-07-02  9:08 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 30.06.25 19:59, Zi Yan wrote:
> On 27 Jun 2025, at 7:55, David Hildenbrand wrote:
> 
>> Instead, let's just allow for specifying through flags whether we want
>> to have bits merged into the original PTE.
>>
>> For the madvise() case, simplify by having only a single parameter for
>> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
>> dirty bit is not required, but also not harmful. This code is not that
>> performance critical after all to really force all micro-optimizations.
>>
>> As we now have two pte_t * parameters, use PageTable() to make sure we
>> are actually given a pointer at a copy of the PTE, not a pointer into
>> an actual page table.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/internal.h | 58 +++++++++++++++++++++++++++++++--------------------
>>   mm/madvise.c  | 26 +++++------------------
>>   mm/memory.c   |  8 ++-----
>>   mm/util.c     |  2 +-
>>   4 files changed, 43 insertions(+), 51 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 6000b683f68ee..fe69e21b34a24 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
>>   /* Compare PTEs honoring the soft-dirty bit. */
>>   #define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
>>
>> +/*
>> + * Merge PTE write bits: if any PTE in the batch is writable, modify the
>> + * PTE at @ptentp to be writable.
>> + */
>> +#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
>> +
>> +/*
>> + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
>> + * modify the PTE at @ptentp to be young or dirty, respectively.
>> + */
>> +#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
>> +
>>   static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   {
>>   	if (!(flags & FPB_HONOR_DIRTY))
>> @@ -220,16 +232,11 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   /**
>>    * folio_pte_batch_ext - detect a PTE batch for a large folio
>>    * @folio: The large folio to detect a PTE batch for.
>> + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>>    * @ptep: Page table pointer for the first entry.
>> - * @pte: Page table entry for the first page.
>> + * @ptentp: Pointer at a copy of the first page table entry.
>>    * @max_nr: The maximum number of table entries to consider.
>>    * @flags: Flags to modify the PTE batch semantics.
>> - * @any_writable: Optional pointer to indicate whether any entry except the
>> - *		  first one is writable.
>> - * @any_young: Optional pointer to indicate whether any entry except the
>> - *		  first one is young.
>> - * @any_dirty: Optional pointer to indicate whether any entry except the
>> - *		  first one is dirty.
>>    *
>>    * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>    * pages of the same large folio in a single VMA and a single page table.
>> @@ -242,28 +249,26 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>    * must be limited by the caller so scanning cannot exceed a single VMA and
>>    * a single page table.
>>    *
>> + * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
>> + * be modified.
>> + *
>>    * This function will be inlined to optimize based on the input parameters;
>>    * consider using folio_pte_batch() instead if applicable.
>>    *
>>    * Return: the number of table entries in the batch.
>>    */
>>   static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>> -		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>> -		bool *any_writable, bool *any_young, bool *any_dirty)
>> +		struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
>> +		unsigned int max_nr, fpb_t flags)
>>   {
>> +	bool any_writable = false, any_young = false, any_dirty = false;
>> +	pte_t expected_pte, pte = *ptentp;
>>   	unsigned int nr, cur_nr;
>> -	pte_t expected_pte;
>> -
>> -	if (any_writable)
>> -		*any_writable = false;
>> -	if (any_young)
>> -		*any_young = false;
>> -	if (any_dirty)
>> -		*any_dirty = false;
>>
>>   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>>   	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>> +	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
> 
> Why not just VM_WARN_ON(!pte_same(*ptep, *ptentp)) ?
> 
> ptep points to the first page table entry and ptentp points to the copy of it.
> I assume ptep should point to a valid page table entry to begin with.

That would also work, only miss the case where someone would by accident 
flip both pointers :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-06-27 11:55 ` [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext() David Hildenbrand
                     ` (3 preceding siblings ...)
  2025-07-02  9:02   ` Oscar Salvador
@ 2025-07-02  9:09   ` Oscar Salvador
  4 siblings, 0 replies; 58+ messages in thread
From: Oscar Salvador @ 2025-07-02  9:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
> 
> So let's provide a basic, non-inlined folio_pte_batch().
> 
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
> 
> While at it, drop the "addr" parameter that is unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

FWIW, folio_pte_batch_flags seems more appealling to me as well.

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-07-02  9:08     ` David Hildenbrand
@ 2025-07-02  9:09       ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-07-02  9:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 02.07.25 11:08, David Hildenbrand wrote:
> On 30.06.25 19:59, Zi Yan wrote:
>> On 27 Jun 2025, at 7:55, David Hildenbrand wrote:
>>
>>> Instead, let's just allow for specifying through flags whether we want
>>> to have bits merged into the original PTE.
>>>
>>> For the madvise() case, simplify by having only a single parameter for
>>> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
>>> dirty bit is not required, but also not harmful. This code is not that
>>> performance critical after all to really force all micro-optimizations.
>>>
>>> As we now have two pte_t * parameters, use PageTable() to make sure we
>>> are actually given a pointer at a copy of the PTE, not a pointer into
>>> an actual page table.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>    mm/internal.h | 58 +++++++++++++++++++++++++++++++--------------------
>>>    mm/madvise.c  | 26 +++++------------------
>>>    mm/memory.c   |  8 ++-----
>>>    mm/util.c     |  2 +-
>>>    4 files changed, 43 insertions(+), 51 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 6000b683f68ee..fe69e21b34a24 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
>>>    /* Compare PTEs honoring the soft-dirty bit. */
>>>    #define FPB_HONOR_SOFT_DIRTY		((__force fpb_t)BIT(1))
>>>
>>> +/*
>>> + * Merge PTE write bits: if any PTE in the batch is writable, modify the
>>> + * PTE at @ptentp to be writable.
>>> + */
>>> +#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
>>> +
>>> +/*
>>> + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
>>> + * modify the PTE at @ptentp to be young or dirty, respectively.
>>> + */
>>> +#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
>>> +
>>>    static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>    {
>>>    	if (!(flags & FPB_HONOR_DIRTY))
>>> @@ -220,16 +232,11 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>    /**
>>>     * folio_pte_batch_ext - detect a PTE batch for a large folio
>>>     * @folio: The large folio to detect a PTE batch for.
>>> + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>>>     * @ptep: Page table pointer for the first entry.
>>> - * @pte: Page table entry for the first page.
>>> + * @ptentp: Pointer at a copy of the first page table entry.
>>>     * @max_nr: The maximum number of table entries to consider.
>>>     * @flags: Flags to modify the PTE batch semantics.
>>> - * @any_writable: Optional pointer to indicate whether any entry except the
>>> - *		  first one is writable.
>>> - * @any_young: Optional pointer to indicate whether any entry except the
>>> - *		  first one is young.
>>> - * @any_dirty: Optional pointer to indicate whether any entry except the
>>> - *		  first one is dirty.
>>>     *
>>>     * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>>     * pages of the same large folio in a single VMA and a single page table.
>>> @@ -242,28 +249,26 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>     * must be limited by the caller so scanning cannot exceed a single VMA and
>>>     * a single page table.
>>>     *
>>> + * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
>>> + * be modified.
>>> + *
>>>     * This function will be inlined to optimize based on the input parameters;
>>>     * consider using folio_pte_batch() instead if applicable.
>>>     *
>>>     * Return: the number of table entries in the batch.
>>>     */
>>>    static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>>> -		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>>> -		bool *any_writable, bool *any_young, bool *any_dirty)
>>> +		struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
>>> +		unsigned int max_nr, fpb_t flags)
>>>    {
>>> +	bool any_writable = false, any_young = false, any_dirty = false;
>>> +	pte_t expected_pte, pte = *ptentp;
>>>    	unsigned int nr, cur_nr;
>>> -	pte_t expected_pte;
>>> -
>>> -	if (any_writable)
>>> -		*any_writable = false;
>>> -	if (any_young)
>>> -		*any_young = false;
>>> -	if (any_dirty)
>>> -		*any_dirty = false;
>>>
>>>    	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>    	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>>>    	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
>>> +	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
>>
>> Why not just VM_WARN_ON(!pte_same(*ptep, *ptentp)) ?
>>
>> ptep points to the first page table entry and ptentp points to the copy of it.
>> I assume ptep should point to a valid page table entry to begin with.
> 
> That would also work, only miss the case where someone would by accident
> flip both pointers :)

... or pass the same pointer twice.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements
  2025-07-02  9:08           ` Lorenzo Stoakes
@ 2025-07-02  9:11             ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-07-02  9:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Oscar Salvador, linux-kernel, linux-mm, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 02.07.25 11:08, Lorenzo Stoakes wrote:
> On Wed, Jul 02, 2025 at 11:00:48AM +0200, David Hildenbrand wrote:
>> On 02.07.25 10:51, Lorenzo Stoakes wrote:
>>> On Wed, Jul 02, 2025 at 10:48:20AM +0200, David Hildenbrand wrote:
>>>> On 02.07.25 10:42, Oscar Salvador wrote:
>>>>> On Fri, Jun 27, 2025 at 01:55:08PM +0200, David Hildenbrand wrote:
>>>>>> Let's clean up a bit:
>>>>>>
>>>>>> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep
>>>>>>
>>>>>> (2) Let's switch to "unsigned int" for everything
>>>>>>
>>>>>> (3) We can simplify the code by leaving the pte unchanged after the
>>>>>>        pte_same() check.
>>>>>>
>>>>>> (4) Clarify that we should never exceed a single VMA; it indicates a
>>>>>>        problem in the caller.
>>>>>>
>>>>>> No functional change intended.
>>>>>>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>
>>>>> Hi David :-),
>>>>>
>>>>> I have to confess that I fell in the same trap as Lorenzo wrt.
>>>>> __pte_batch_clear_ignored changing the pte value.
>>>>> So I'm not sure if it would be nice to place a little comment in
>>>>> __pte_batch_clear_ignored claryfing that pte's value remains unchanged ?
>>>>
>>>> I mean, that's how all our pte modification functions work, really? :)
>>>>
>>>> Thanks!
>>>
>>> I mean, it might be that me and Oscar are similarly 'challenged' in this
>>> respect :P (high 5 Oscar!) but I think the issue here is that it's sort of
>>> a compounded use, and in fact some functions do modify stuff, which is why
>>> we end up with all the ptep ptent etc. fun.
>>>
>>> Up to you re: comment, but I think maybe in cases where it's a reallly
>>> compounded set of stuff it's potentially useful.
>>>
>>> But obviously we still do do this all over the place elsewhere with no
>>> comment...
>>
>> Well, if you are not passing in a *value* and not a pointer to a function,
>> you would not expect for that *value* to change? :)
>>
>> Yes, once we pass pointers it's different. Or when we're using weird macros.
>>
>> Adding a comment that a function will not modify a value that is ...
>> passed-by-value? Maybe it's just me that doesn't get why that should be
>> particularly helpful :)
> 
> I think the issue is that we've passed around 'pte' as value and pointer (and of
> course, via macros...)  previously so that's the cause of the confusion, often.
> 
> This is why I really am a fan of us consistently saying ptep when passing a
> pointer.

100%: pte for pointers is absolutely nasty.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
  2025-07-02  9:07       ` Oscar Salvador
@ 2025-07-02  9:11         ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2025-07-02  9:11 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo

On 02.07.25 11:07, Oscar Salvador wrote:
> On Wed, Jul 02, 2025 at 11:05:17AM +0200, David Hildenbrand wrote:
>> Not necessarily inlined into the body (there might still be a function call,
>> depending on what the compiler decides), but inlined into the object file
>> and optimized by propagating constants.
> 
> I see.
> 
>>> With this change, a single function, folio_pte_batch(), identical to folio_pte_batch_ext
>>> but without the runtime checks for those arguments will be created (folio_pte_batch()),
>>> and so the users of it won't have it inlined in their body ?
>>
>> Right. We have a single folio_pte_batch() that is optimized by propagating
>> all constants. Instead of having one per object file, we have a single
>> shared one.
> 
> Alright, clear to me now, thanks for claryfing ;-)!

Will clarify that in the patch description, thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2025-07-02  9:11 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 11:55 [PATCH v1 0/4] mm: folio_pte_batch() improvements David Hildenbrand
2025-06-27 11:55 ` [PATCH v1 1/4] mm: convert FPB_IGNORE_* into FPB_HONOR_* David Hildenbrand
2025-06-27 13:40   ` Lance Yang
2025-06-27 16:28   ` Lorenzo Stoakes
2025-06-27 16:30     ` David Hildenbrand
2025-06-27 16:33       ` Lorenzo Stoakes
2025-06-29  8:59         ` Mike Rapoport
2025-06-30 13:47           ` David Hildenbrand
2025-06-28  3:37   ` Dev Jain
2025-06-28 21:00     ` David Hildenbrand
2025-06-30  3:34       ` Dev Jain
2025-06-30  9:04         ` Ryan Roberts
2025-06-30  9:08           ` David Hildenbrand
2025-06-30  9:18             ` Ryan Roberts
2025-06-30  9:24               ` David Hildenbrand
2025-06-30 10:57                 ` Ryan Roberts
2025-06-30 11:01                   ` David Hildenbrand
2025-06-30 14:35   ` Zi Yan
2025-07-02  8:31   ` Oscar Salvador
2025-06-27 11:55 ` [PATCH v1 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
2025-06-27 13:58   ` Lance Yang
2025-06-27 16:51   ` Lorenzo Stoakes
2025-06-27 17:02     ` David Hildenbrand
2025-06-27 18:39       ` Lorenzo Stoakes
2025-06-30 17:40   ` Zi Yan
2025-07-02  8:42   ` Oscar Salvador
2025-07-02  8:48     ` David Hildenbrand
2025-07-02  8:51       ` Lorenzo Stoakes
2025-07-02  9:00         ` David Hildenbrand
2025-07-02  9:08           ` Lorenzo Stoakes
2025-07-02  9:11             ` David Hildenbrand
2025-06-27 11:55 ` [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext() David Hildenbrand
2025-06-27 14:19   ` Lance Yang
2025-06-27 15:09     ` David Hildenbrand
2025-06-27 15:45       ` Lance Yang
2025-06-27 18:48   ` Lorenzo Stoakes
2025-06-30  9:19     ` David Hildenbrand
2025-06-30 10:41       ` Lorenzo Stoakes
2025-06-30 10:54         ` David Hildenbrand
2025-06-30 17:45   ` Zi Yan
2025-07-02  9:02   ` Oscar Salvador
2025-07-02  9:05     ` David Hildenbrand
2025-07-02  9:07       ` Oscar Salvador
2025-07-02  9:11         ` David Hildenbrand
2025-07-02  9:09   ` Oscar Salvador
2025-06-27 11:55 ` [PATCH v1 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
2025-06-27 14:34   ` Lance Yang
2025-06-27 15:11     ` David Hildenbrand
2025-06-27 15:40       ` Lance Yang
2025-06-27 19:04   ` Lorenzo Stoakes
2025-06-30  9:32     ` David Hildenbrand
2025-06-30 11:08       ` Lorenzo Stoakes
2025-06-30 11:16         ` David Hildenbrand
2025-06-30 11:18           ` Lorenzo Stoakes
2025-06-30 11:21             ` David Hildenbrand
2025-06-30 17:59   ` Zi Yan
2025-07-02  9:08     ` David Hildenbrand
2025-07-02  9:09       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).