linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] mm: folio_pte_batch() improvements
@ 2025-07-02 10:49 David Hildenbrand
  2025-07-02 10:49 ` [PATCH v2 1/4] mm: convert FPB_IGNORE_* into FPB_RESPECT_* David Hildenbrand
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-07-02 10:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador

Ever since we added folio_pte_batch() for fork() + munmap() purposes,
a lot more users appeared (and more are being proposed), and more
functionality was added.

Most of the users only need basic functionality, and could benefit from
a non-inlined version.

So let's clean up folio_pte_batch() and split it into a basic
folio_pte_batch() (no flags) and a more advanced folio_pte_batch_ext().
Using either variant will now look much cleaner.

This series will likely conflict with some changes in some
(old+new) folio_pte_batch() users, but conflicts should be trivial to
resolve.

Tested on x86-64. Cross-compile tested.

v1 -> v2:
* Use FPB_RESPECT_* instead of FPB_HONOR_*
* Use folio_pte_batch_flags() instead of folio_pte_batch_ext()
* Improvements to patch descriptions + comments/doc

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>

David Hildenbrand (4):
  mm: convert FPB_IGNORE_* into FPB_RESPECT_*
  mm: smaller folio_pte_batch() improvements
  mm: split folio_pte_batch() into folio_pte_batch() and
    folio_pte_batch_flags()
  mm: remove boolean output parameters from folio_pte_batch_ext()

 mm/internal.h  | 117 ++++++++++++++++++++++++++++---------------------
 mm/madvise.c   |  27 +++---------
 mm/memory.c    |  21 +++------
 mm/mempolicy.c |   5 +--
 mm/mlock.c     |   4 +-
 mm/mremap.c    |   4 +-
 mm/rmap.c      |   4 +-
 mm/util.c      |  29 ++++++++++++
 8 files changed, 112 insertions(+), 99 deletions(-)


base-commit: 01136079697c6686e7198bf1797c004767ecf6f1
-- 
2.49.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/4] mm: convert FPB_IGNORE_* into FPB_RESPECT_*
  2025-07-02 10:49 [PATCH v2 0/4] mm: folio_pte_batch() improvements David Hildenbrand
@ 2025-07-02 10:49 ` David Hildenbrand
  2025-07-03 10:05   ` Dev Jain
  2025-07-02 10:49 ` [PATCH v2 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-07-02 10:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador, Lance Yang

Respecting these PTE bits is the exception, so let's invert the meaning.

With this change, most callers don't have to pass any flags. This is
a preparation for splitting folio_pte_batch() into a non-inlined
variant that doesn't consume any flags.

Long-term, we want folio_pte_batch() to probably ignore most common
PTE bits (e.g., write/dirty/young/soft-dirty) that are not relevant for
most page table walkers: uffd-wp and protnone might be bits to consider in
the future. Only walkers that care about them can opt-in to respect them.

No functional change intended.

Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h  | 16 ++++++++--------
 mm/madvise.c   |  3 +--
 mm/memory.c    | 11 +++++------
 mm/mempolicy.c |  4 +---
 mm/mlock.c     |  3 +--
 mm/mremap.c    |  3 +--
 mm/rmap.c      |  3 +--
 7 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e84217e27778d..170d55b6851ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -202,17 +202,17 @@ static inline void vma_close(struct vm_area_struct *vma)
 /* Flags for folio_pte_batch(). */
 typedef int __bitwise fpb_t;
 
-/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
-#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
+/* Compare PTEs respecting the dirty bit. */
+#define FPB_RESPECT_DIRTY		((__force fpb_t)BIT(0))
 
-/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
-#define FPB_IGNORE_SOFT_DIRTY		((__force fpb_t)BIT(1))
+/* Compare PTEs respecting the soft-dirty bit. */
+#define FPB_RESPECT_SOFT_DIRTY		((__force fpb_t)BIT(1))
 
 static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 {
-	if (flags & FPB_IGNORE_DIRTY)
+	if (!(flags & FPB_RESPECT_DIRTY))
 		pte = pte_mkclean(pte);
-	if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
+	if (likely(!(flags & FPB_RESPECT_SOFT_DIRTY)))
 		pte = pte_clear_soft_dirty(pte);
 	return pte_wrprotect(pte_mkold(pte));
 }
@@ -236,8 +236,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * pages of the same large folio.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
- * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
- * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
+ * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set)
+ * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set).
  *
  * start_ptep must map any page of the folio. max_nr must be at least one and
  * must be limited by the caller so scanning cannot exceed a single page table.
diff --git a/mm/madvise.c b/mm/madvise.c
index e61e32b2cd91f..661bb743d2216 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -347,10 +347,9 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
 					  pte_t pte, bool *any_young,
 					  bool *any_dirty)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	int max_nr = (end - addr) / PAGE_SIZE;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
 			       any_young, any_dirty);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 0f9b32a20e5b7..0a47583ca9937 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -990,10 +990,10 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	 * by keeping the batching logic separate.
 	 */
 	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
-		if (src_vma->vm_flags & VM_SHARED)
-			flags |= FPB_IGNORE_DIRTY;
-		if (!vma_soft_dirty_enabled(src_vma))
-			flags |= FPB_IGNORE_SOFT_DIRTY;
+		if (!(src_vma->vm_flags & VM_SHARED))
+			flags |= FPB_RESPECT_DIRTY;
+		if (vma_soft_dirty_enabled(src_vma))
+			flags |= FPB_RESPECT_SOFT_DIRTY;
 
 		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
 				     &any_writable, NULL, NULL);
@@ -1535,7 +1535,6 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 		struct zap_details *details, int *rss, bool *force_flush,
 		bool *force_break, bool *any_skipped)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct mm_struct *mm = tlb->mm;
 	struct folio *folio;
 	struct page *page;
@@ -1565,7 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	 * by keeping the batching logic separate.
 	 */
 	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
-		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
+		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
 				     NULL, NULL, NULL);
 
 		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1ff7b2174eb77..2a25eedc3b1c0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -675,7 +675,6 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
 static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 			unsigned long end, struct mm_walk *walk)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct vm_area_struct *vma = walk->vma;
 	struct folio *folio;
 	struct queue_pages *qp = walk->private;
@@ -713,8 +712,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		if (folio_test_large(folio) && max_nr != 1)
 			nr = folio_pte_batch(folio, addr, pte, ptent,
-					     max_nr, fpb_flags,
-					     NULL, NULL, NULL);
+					     max_nr, 0, NULL, NULL, NULL);
 		/*
 		 * vm_normal_folio() filters out zero pages, but there might
 		 * still be reserved folios to skip, perhaps in a VDSO.
diff --git a/mm/mlock.c b/mm/mlock.c
index 3cb72b579ffd3..2238cdc5eb1c1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -307,14 +307,13 @@ void munlock_folio(struct folio *folio)
 static inline unsigned int folio_mlock_step(struct folio *folio,
 		pte_t *pte, unsigned long addr, unsigned long end)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	unsigned int count = (end - addr) >> PAGE_SHIFT;
 	pte_t ptent = ptep_get(pte);
 
 	if (!folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, pte, ptent, count, fpb_flags, NULL,
+	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
 			       NULL, NULL);
 }
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 36585041c760d..d4d3ffc931502 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -173,7 +173,6 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
 		pte_t *ptep, pte_t pte, int max_nr)
 {
-	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct folio *folio;
 
 	if (max_nr == 1)
@@ -183,7 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
 	if (!folio || !folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
 			       NULL, NULL);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 34311f654d0c2..98ed3e49e8bbe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1849,7 +1849,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 			struct page_vma_mapped_walk *pvmw,
 			enum ttu_flags flags, pte_t pte)
 {
-	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	unsigned long end_addr, addr = pvmw->address;
 	struct vm_area_struct *vma = pvmw->vma;
 	unsigned int max_nr;
@@ -1869,7 +1868,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	if (pte_unused(pte))
 		return 1;
 
-	return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, fpb_flags,
+	return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, 0,
 			       NULL, NULL, NULL);
 }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/4] mm: smaller folio_pte_batch() improvements
  2025-07-02 10:49 [PATCH v2 0/4] mm: folio_pte_batch() improvements David Hildenbrand
  2025-07-02 10:49 ` [PATCH v2 1/4] mm: convert FPB_IGNORE_* into FPB_RESPECT_* David Hildenbrand
@ 2025-07-02 10:49 ` David Hildenbrand
  2025-07-02 14:24   ` Zi Yan
  2025-07-03 10:18   ` Dev Jain
  2025-07-02 10:49 ` [PATCH v2 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags() David Hildenbrand
  2025-07-02 10:49 ` [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
  3 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-07-02 10:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador, Lance Yang

Let's clean up a bit:

(1) No need for start_ptep vs. ptep anymore, we can simply use ptep.

(2) Let's switch to "unsigned int" for everything. Negative values do
    not make sense.

(3) We can simplify the code by leaving the pte unchanged after the
    pte_same() check.

(4) Clarify that we should never exceed a single VMA; it indicates a
    problem in the caller.

No functional change intended.

Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 170d55b6851ff..dba1346ded972 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -221,7 +221,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * folio_pte_batch - detect a PTE batch for a large folio
  * @folio: The large folio to detect a PTE batch for.
  * @addr: The user virtual address the first page is mapped at.
- * @start_ptep: Page table pointer for the first entry.
+ * @ptep: Page table pointer for the first entry.
  * @pte: Page table entry for the first page.
  * @max_nr: The maximum number of table entries to consider.
  * @flags: Flags to modify the PTE batch semantics.
@@ -233,24 +233,24 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  *		  first one is dirty.
  *
  * Detect a PTE batch: consecutive (present) PTEs that map consecutive
- * pages of the same large folio.
+ * pages of the same large folio in a single VMA and a single page table.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
  * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set)
  * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set).
  *
- * start_ptep must map any page of the folio. max_nr must be at least one and
- * must be limited by the caller so scanning cannot exceed a single page table.
+ * @ptep must map any page of the folio. max_nr must be at least one and
+ * must be limited by the caller so scanning cannot exceed a single VMA and
+ * a single page table.
  *
  * Return: the number of table entries in the batch.
  */
-static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
+static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
+		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
 		bool *any_writable, bool *any_young, bool *any_dirty)
 {
-	pte_t expected_pte, *ptep;
-	bool writable, young, dirty;
-	int nr, cur_nr;
+	unsigned int nr, cur_nr;
+	pte_t expected_pte;
 
 	if (any_writable)
 		*any_writable = false;
@@ -267,29 +267,22 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 	max_nr = min_t(unsigned long, max_nr,
 		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
 
-	nr = pte_batch_hint(start_ptep, pte);
+	nr = pte_batch_hint(ptep, pte);
 	expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
-	ptep = start_ptep + nr;
+	ptep = ptep + nr;
 
 	while (nr < max_nr) {
 		pte = ptep_get(ptep);
-		if (any_writable)
-			writable = !!pte_write(pte);
-		if (any_young)
-			young = !!pte_young(pte);
-		if (any_dirty)
-			dirty = !!pte_dirty(pte);
-		pte = __pte_batch_clear_ignored(pte, flags);
 
-		if (!pte_same(pte, expected_pte))
+		if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
 			break;
 
 		if (any_writable)
-			*any_writable |= writable;
+			*any_writable |= pte_write(pte);
 		if (any_young)
-			*any_young |= young;
+			*any_young |= pte_young(pte);
 		if (any_dirty)
-			*any_dirty |= dirty;
+			*any_dirty |= pte_dirty(pte);
 
 		cur_nr = pte_batch_hint(ptep, pte);
 		expected_pte = pte_advance_pfn(expected_pte, cur_nr);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags()
  2025-07-02 10:49 [PATCH v2 0/4] mm: folio_pte_batch() improvements David Hildenbrand
  2025-07-02 10:49 ` [PATCH v2 1/4] mm: convert FPB_IGNORE_* into FPB_RESPECT_* David Hildenbrand
  2025-07-02 10:49 ` [PATCH v2 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
@ 2025-07-02 10:49 ` David Hildenbrand
  2025-07-02 14:25   ` Zi Yan
  2025-07-03 10:45   ` Dev Jain
  2025-07-02 10:49 ` [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
  3 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-07-02 10:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador

Many users (including upcoming ones) don't really need the flags etc,
and can live with the possible overhead of a function call.

So let's provide a basic, non-inlined folio_pte_batch(), to avoid code
bloat while still providing a variant that optimizes out all flag
checks at runtime. folio_pte_batch_flags() will get inlined into
folio_pte_batch(), optimizing out any conditionals that depend on input
flags.

folio_pte_batch() will behave like folio_pte_batch_flags() when no
flags are specified. It's okay to add new users of
folio_pte_batch_flags(), but using folio_pte_batch() if applicable is
preferred.

So, before this change, folio_pte_batch() was inlined into the C file
optimized by propagating constants within the resulting object file.

With this change, we now also have a folio_pte_batch() that is
optimized by propagating all constants. But instead of having one instance
per object file, we have a single shared one.

In zap_present_ptes(), where we care about performance, the compiler
already seem to generate a call to a common inlined folio_pte_batch()
variant, shared with fork() code. So calling the new non-inlined variant
should not make a difference.

While at it, drop the "addr" parameter that is unused.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h  | 11 ++++++++---
 mm/madvise.c   |  4 ++--
 mm/memory.c    |  8 +++-----
 mm/mempolicy.c |  3 +--
 mm/mlock.c     |  3 +--
 mm/mremap.c    |  3 +--
 mm/rmap.c      |  3 +--
 mm/util.c      | 29 +++++++++++++++++++++++++++++
 8 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index dba1346ded972..6c92956ac4fd9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 }
 
 /**
- * folio_pte_batch - detect a PTE batch for a large folio
+ * folio_pte_batch_flags - detect a PTE batch for a large folio
  * @folio: The large folio to detect a PTE batch for.
- * @addr: The user virtual address the first page is mapped at.
  * @ptep: Page table pointer for the first entry.
  * @pte: Page table entry for the first page.
  * @max_nr: The maximum number of table entries to consider.
@@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * must be limited by the caller so scanning cannot exceed a single VMA and
  * a single page table.
  *
+ * This function will be inlined to optimize based on the input parameters;
+ * consider using folio_pte_batch() instead if applicable.
+ *
  * Return: the number of table entries in the batch.
  */
-static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
+static inline unsigned int folio_pte_batch_flags(struct folio *folio,
 		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
 		bool *any_writable, bool *any_young, bool *any_dirty)
 {
@@ -293,6 +295,9 @@ static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long ad
 	return min(nr, max_nr);
 }
 
+unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
+		unsigned int max_nr);
+
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
  *	 forward or backward by delta
diff --git a/mm/madvise.c b/mm/madvise.c
index 661bb743d2216..fe363a14daab3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -349,8 +349,8 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
 {
 	int max_nr = (end - addr) / PAGE_SIZE;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
-			       any_young, any_dirty);
+	return folio_pte_batch_flags(folio, ptep, pte, max_nr, 0, NULL,
+				     any_young, any_dirty);
 }
 
 static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
diff --git a/mm/memory.c b/mm/memory.c
index 0a47583ca9937..26a82b82863b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -995,8 +995,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		if (vma_soft_dirty_enabled(src_vma))
 			flags |= FPB_RESPECT_SOFT_DIRTY;
 
-		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
-				     &any_writable, NULL, NULL);
+		nr = folio_pte_batch_flags(folio, src_pte, pte, max_nr, flags,
+					   &any_writable, NULL, NULL);
 		folio_ref_add(folio, nr);
 		if (folio_test_anon(folio)) {
 			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
@@ -1564,9 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	 * by keeping the batching logic separate.
 	 */
 	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
-		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
-				     NULL, NULL, NULL);
-
+		nr = folio_pte_batch(folio, pte, ptent, max_nr);
 		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
 				       addr, details, rss, force_flush,
 				       force_break, any_skipped);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2a25eedc3b1c0..eb83cff7db8c3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -711,8 +711,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!folio || folio_is_zone_device(folio))
 			continue;
 		if (folio_test_large(folio) && max_nr != 1)
-			nr = folio_pte_batch(folio, addr, pte, ptent,
-					     max_nr, 0, NULL, NULL, NULL);
+			nr = folio_pte_batch(folio, pte, ptent, max_nr);
 		/*
 		 * vm_normal_folio() filters out zero pages, but there might
 		 * still be reserved folios to skip, perhaps in a VDSO.
diff --git a/mm/mlock.c b/mm/mlock.c
index 2238cdc5eb1c1..a1d93ad33c6db 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -313,8 +313,7 @@ static inline unsigned int folio_mlock_step(struct folio *folio,
 	if (!folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
-			       NULL, NULL);
+	return folio_pte_batch(folio, pte, ptent, count);
 }
 
 static inline bool allow_mlock_munlock(struct folio *folio,
diff --git a/mm/mremap.c b/mm/mremap.c
index d4d3ffc931502..1f5bebbb9c0cb 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -182,8 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
 	if (!folio || !folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
-			       NULL, NULL);
+	return folio_pte_batch(folio, ptep, pte, max_nr);
 }
 
 static int move_ptes(struct pagetable_move_control *pmc,
diff --git a/mm/rmap.c b/mm/rmap.c
index 98ed3e49e8bbe..a15939453c41a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1868,8 +1868,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	if (pte_unused(pte))
 		return 1;
 
-	return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, 0,
-			       NULL, NULL, NULL);
+	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
 }
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 0b270c43d7d12..cf41edceec7d2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1171,3 +1171,32 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 EXPORT_SYMBOL(compat_vma_mmap_prepare);
+
+#ifdef CONFIG_MMU
+/**
+ * folio_pte_batch - detect a PTE batch for a large folio
+ * @folio: The large folio to detect a PTE batch for.
+ * @ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @max_nr: The maximum number of table entries to consider.
+ *
+ * This is a simplified variant of folio_pte_batch_flags().
+ *
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive
+ * pages of the same large folio in a single VMA and a single page table.
+ *
+ * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
+ * the accessed bit, writable bit, dirt-bit and soft-dirty bit.
+ *
+ * ptep must map any page of the folio. max_nr must be at least one and
+ * must be limited by the caller so scanning cannot exceed a single VMA and
+ * a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
+		unsigned int max_nr)
+{
+	return folio_pte_batch_flags(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
+}
+#endif /* CONFIG_MMU */
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-07-02 10:49 [PATCH v2 0/4] mm: folio_pte_batch() improvements David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-07-02 10:49 ` [PATCH v2 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags() David Hildenbrand
@ 2025-07-02 10:49 ` David Hildenbrand
  2025-07-02 14:00   ` Oscar Salvador
                     ` (3 more replies)
  3 siblings, 4 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-07-02 10:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador

Instead, let's just allow for specifying through flags whether we want
to have bits merged into the original PTE.

For the madvise() case, simplify by having only a single parameter for
merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
dirty bit is not required, but also not harmful. This code is not that
performance critical after all to really force all micro-optimizations.

As we now have two pte_t * parameters, use PageTable() to make sure we
are actually given a pointer at a copy of the PTE, not a pointer into
an actual page table.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h | 65 +++++++++++++++++++++++++++++++++------------------
 mm/madvise.c  | 26 ++++-----------------
 mm/memory.c   |  8 ++-----
 mm/util.c     |  2 +-
 4 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 6c92956ac4fd9..b7131bd3d1ad1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -208,6 +208,18 @@ typedef int __bitwise fpb_t;
 /* Compare PTEs respecting the soft-dirty bit. */
 #define FPB_RESPECT_SOFT_DIRTY		((__force fpb_t)BIT(1))
 
+/*
+ * Merge PTE write bits: if any PTE in the batch is writable, modify the
+ * PTE at @ptentp to be writable.
+ */
+#define FPB_MERGE_WRITE			((__force fpb_t)BIT(2))
+
+/*
+ * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty,
+ * modify the PTE at @ptentp to be young or dirty, respectively.
+ */
+#define FPB_MERGE_YOUNG_DIRTY		((__force fpb_t)BIT(3))
+
 static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 {
 	if (!(flags & FPB_RESPECT_DIRTY))
@@ -220,16 +232,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 /**
  * folio_pte_batch_flags - detect a PTE batch for a large folio
  * @folio: The large folio to detect a PTE batch for.
+ * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
  * @ptep: Page table pointer for the first entry.
- * @pte: Page table entry for the first page.
+ * @ptentp: Pointer to a COPY of the first page table entry whose flags this
+ *	    function updates based on @flags if appropriate.
  * @max_nr: The maximum number of table entries to consider.
  * @flags: Flags to modify the PTE batch semantics.
- * @any_writable: Optional pointer to indicate whether any entry except the
- *		  first one is writable.
- * @any_young: Optional pointer to indicate whether any entry except the
- *		  first one is young.
- * @any_dirty: Optional pointer to indicate whether any entry except the
- *		  first one is dirty.
  *
  * Detect a PTE batch: consecutive (present) PTEs that map consecutive
  * pages of the same large folio in a single VMA and a single page table.
@@ -242,28 +250,32 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * must be limited by the caller so scanning cannot exceed a single VMA and
  * a single page table.
  *
+ * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
+ * be updated: it's crucial that a pointer to a COPY of the first
+ * page table entry, obtained through ptep_get(), is provided as @ptentp.
+ *
  * This function will be inlined to optimize based on the input parameters;
  * consider using folio_pte_batch() instead if applicable.
  *
  * Return: the number of table entries in the batch.
  */
 static inline unsigned int folio_pte_batch_flags(struct folio *folio,
-		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
-		bool *any_writable, bool *any_young, bool *any_dirty)
+		struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
+		unsigned int max_nr, fpb_t flags)
 {
+	bool any_writable = false, any_young = false, any_dirty = false;
+	pte_t expected_pte, pte = *ptentp;
 	unsigned int nr, cur_nr;
-	pte_t expected_pte;
-
-	if (any_writable)
-		*any_writable = false;
-	if (any_young)
-		*any_young = false;
-	if (any_dirty)
-		*any_dirty = false;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
 	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+	/*
+	 * Ensure this is a pointer to a copy not a pointer into a page table.
+	 * If this is a stack value, it won't be a valid virtual address, but
+	 * that's fine because it also cannot be pointing into the page table.
+	 */
+	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
 
 	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
 	max_nr = min_t(unsigned long, max_nr,
@@ -279,12 +291,12 @@ static inline unsigned int folio_pte_batch_flags(struct folio *folio,
 		if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte))
 			break;
 
-		if (any_writable)
-			*any_writable |= pte_write(pte);
-		if (any_young)
-			*any_young |= pte_young(pte);
-		if (any_dirty)
-			*any_dirty |= pte_dirty(pte);
+		if (flags & FPB_MERGE_WRITE)
+			any_writable |= pte_write(pte);
+		if (flags & FPB_MERGE_YOUNG_DIRTY) {
+			any_young |= pte_young(pte);
+			any_dirty |= pte_dirty(pte);
+		}
 
 		cur_nr = pte_batch_hint(ptep, pte);
 		expected_pte = pte_advance_pfn(expected_pte, cur_nr);
@@ -292,6 +304,13 @@ static inline unsigned int folio_pte_batch_flags(struct folio *folio,
 		nr += cur_nr;
 	}
 
+	if (any_writable)
+		*ptentp = pte_mkwrite(*ptentp, vma);
+	if (any_young)
+		*ptentp = pte_mkyoung(*ptentp);
+	if (any_dirty)
+		*ptentp = pte_mkdirty(*ptentp);
+
 	return min(nr, max_nr);
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index fe363a14daab3..9de9b7c797c63 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -344,13 +344,12 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
 
 static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
 					  struct folio *folio, pte_t *ptep,
-					  pte_t pte, bool *any_young,
-					  bool *any_dirty)
+					  pte_t *ptentp)
 {
 	int max_nr = (end - addr) / PAGE_SIZE;
 
-	return folio_pte_batch_flags(folio, ptep, pte, max_nr, 0, NULL,
-				     any_young, any_dirty);
+	return folio_pte_batch_flags(folio, NULL, ptep, ptentp, max_nr,
+				     FPB_MERGE_YOUNG_DIRTY);
 }
 
 static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
@@ -488,13 +487,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		 * next pte in the range.
 		 */
 		if (folio_test_large(folio)) {
-			bool any_young;
-
-			nr = madvise_folio_pte_batch(addr, end, folio, pte,
-						     ptent, &any_young, NULL);
-			if (any_young)
-				ptent = pte_mkyoung(ptent);
-
+			nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent);
 			if (nr < folio_nr_pages(folio)) {
 				int err;
 
@@ -724,11 +717,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 * next pte in the range.
 		 */
 		if (folio_test_large(folio)) {
-			bool any_young, any_dirty;
-
-			nr = madvise_folio_pte_batch(addr, end, folio, pte,
-						     ptent, &any_young, &any_dirty);
-
+			nr = madvise_folio_pte_batch(addr, end, folio, pte, &ptent);
 			if (nr < folio_nr_pages(folio)) {
 				int err;
 
@@ -753,11 +742,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 					nr = 0;
 				continue;
 			}
-
-			if (any_young)
-				ptent = pte_mkyoung(ptent);
-			if (any_dirty)
-				ptent = pte_mkdirty(ptent);
 		}
 
 		if (folio_test_swapcache(folio) || folio_test_dirty(folio)) {
diff --git a/mm/memory.c b/mm/memory.c
index 26a82b82863b0..0269edb520987 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -972,10 +972,9 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		 pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
 		 int max_nr, int *rss, struct folio **prealloc)
 {
+	fpb_t flags = FPB_MERGE_WRITE;
 	struct page *page;
 	struct folio *folio;
-	bool any_writable;
-	fpb_t flags = 0;
 	int err, nr;
 
 	page = vm_normal_page(src_vma, addr, pte);
@@ -995,8 +994,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		if (vma_soft_dirty_enabled(src_vma))
 			flags |= FPB_RESPECT_SOFT_DIRTY;
 
-		nr = folio_pte_batch_flags(folio, src_pte, pte, max_nr, flags,
-					   &any_writable, NULL, NULL);
+		nr = folio_pte_batch_flags(folio, src_vma, src_pte, &pte, max_nr, flags);
 		folio_ref_add(folio, nr);
 		if (folio_test_anon(folio)) {
 			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
@@ -1010,8 +1008,6 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 			folio_dup_file_rmap_ptes(folio, page, nr, dst_vma);
 			rss[mm_counter_file(folio)] += nr;
 		}
-		if (any_writable)
-			pte = pte_mkwrite(pte, src_vma);
 		__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
 				    addr, nr);
 		return nr;
diff --git a/mm/util.c b/mm/util.c
index cf41edceec7d2..ce826ca82a11d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1197,6 +1197,6 @@ EXPORT_SYMBOL(compat_vma_mmap_prepare);
 unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
 		unsigned int max_nr)
 {
-	return folio_pte_batch_flags(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
+	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0);
 }
 #endif /* CONFIG_MMU */
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-07-02 10:49 ` [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
@ 2025-07-02 14:00   ` Oscar Salvador
  2025-07-02 14:40     ` David Hildenbrand
  2025-07-03  9:16   ` Oscar Salvador
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2025-07-02 14:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang

On Wed, Jul 02, 2025 at 12:49:26PM +0200, David Hildenbrand wrote:
> Instead, let's just allow for specifying through flags whether we want
> to have bits merged into the original PTE.
> 
> For the madvise() case, simplify by having only a single parameter for
> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
> dirty bit is not required, but also not harmful. This code is not that
> performance critical after all to really force all micro-optimizations.

Seems a bit odd to me to have the dirty-young bit being treat as "one".
You mention that this is done because the only user of it doesn't really
care about dirty vs non-dirty and it's not harmful eitherway, so micro-optimizing
this isn't worth at this moment.

But what if we grop another user which wants to make this distinction and
where it matters dirty vs non-dirty/young vs not-young.
Wouldn't be better to have it separated from the start?

And I'm not talking about micro-optimizing, because that's clear that it
doesn't matter, but for clarity purposes.
It seems a lot more organic/natural/obvious to me.



-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/4] mm: smaller folio_pte_batch() improvements
  2025-07-02 10:49 ` [PATCH v2 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
@ 2025-07-02 14:24   ` Zi Yan
  2025-07-03 10:18   ` Dev Jain
  1 sibling, 0 replies; 17+ messages in thread
From: Zi Yan @ 2025-07-02 14:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador, Lance Yang

On 2 Jul 2025, at 6:49, David Hildenbrand wrote:

> Let's clean up a bit:
>
> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep.
>
> (2) Let's switch to "unsigned int" for everything. Negative values do
>     not make sense.
>
> (3) We can simplify the code by leaving the pte unchanged after the
>     pte_same() check.
>
> (4) Clarify that we should never exceed a single VMA; it indicates a
>     problem in the caller.
>
> No functional change intended.
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags()
  2025-07-02 10:49 ` [PATCH v2 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags() David Hildenbrand
@ 2025-07-02 14:25   ` Zi Yan
  2025-07-03 10:45   ` Dev Jain
  1 sibling, 0 replies; 17+ messages in thread
From: Zi Yan @ 2025-07-02 14:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador

On 2 Jul 2025, at 6:49, David Hildenbrand wrote:

> Many users (including upcoming ones) don't really need the flags etc,
> and can live with the possible overhead of a function call.
>
> So let's provide a basic, non-inlined folio_pte_batch(), to avoid code
> bloat while still providing a variant that optimizes out all flag
> checks at runtime. folio_pte_batch_flags() will get inlined into
> folio_pte_batch(), optimizing out any conditionals that depend on input
> flags.
>
> folio_pte_batch() will behave like folio_pte_batch_flags() when no
> flags are specified. It's okay to add new users of
> folio_pte_batch_flags(), but using folio_pte_batch() if applicable is
> preferred.
>
> So, before this change, folio_pte_batch() was inlined into the C file
> optimized by propagating constants within the resulting object file.
>
> With this change, we now also have a folio_pte_batch() that is
> optimized by propagating all constants. But instead of having one instance
> per object file, we have a single shared one.
>
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
>
> While at it, drop the "addr" parameter that is unused.
>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h  | 11 ++++++++---
>  mm/madvise.c   |  4 ++--
>  mm/memory.c    |  8 +++-----
>  mm/mempolicy.c |  3 +--
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  mm/util.c      | 29 +++++++++++++++++++++++++++++
>  8 files changed, 46 insertions(+), 18 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-07-02 14:00   ` Oscar Salvador
@ 2025-07-02 14:40     ` David Hildenbrand
  2025-07-02 17:53       ` Oscar Salvador
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-07-02 14:40 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang

On 02.07.25 16:00, Oscar Salvador wrote:
> On Wed, Jul 02, 2025 at 12:49:26PM +0200, David Hildenbrand wrote:
>> Instead, let's just allow for specifying through flags whether we want
>> to have bits merged into the original PTE.
>>
>> For the madvise() case, simplify by having only a single parameter for
>> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
>> dirty bit is not required, but also not harmful. This code is not that
>> performance critical after all to really force all micro-optimizations.
> 
> Seems a bit odd to me to have the dirty-young bit being treat as "one".
> You mention that this is done because the only user of it doesn't really
> care about dirty vs non-dirty and it's not harmful eitherway, so micro-optimizing
> this isn't worth at this moment.
> 
> But what if we grop another user which wants to make this distinction and
> where it matters dirty vs non-dirty/young vs not-young.
> Wouldn't be better to have it separated from the start?

I mean, it's easy to add later if ever required. :)

But more importantly, usually young+dirty is a moving target as HW can 
usually update it asynchronously.

So in the common case, you really have to rely on the young+dirty bits 
from get_and_clear_full_ptes(), and not on the current snapshot while 
the page remains mapped.

The madvise() use case is rather special in that sense.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-07-02 14:40     ` David Hildenbrand
@ 2025-07-02 17:53       ` Oscar Salvador
  0 siblings, 0 replies; 17+ messages in thread
From: Oscar Salvador @ 2025-07-02 17:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang

On Wed, Jul 02, 2025 at 04:40:30PM +0200, David Hildenbrand wrote:
> But more importantly, usually young+dirty is a moving target as HW can
> usually update it asynchronously.

Ok, you mean they often get queried both at once?

> So in the common case, you really have to rely on the young+dirty bits from
> get_and_clear_full_ptes(), and not on the current snapshot while the page
> remains mapped.
> 
> The madvise() use case is rather special in that sense.

I see.
I mean, codewise this looks like an improvement, I was just puzzled by the
fact that we're dealing with young+dirty together (while we didn't before),
but given that get_and_clear_full_ptes() do that already, I guess it makes
sense if that's the way we usually operate.

I wasn't familiar with that, thinking about it makes sense, but I wonder whether
we could place a comment either in the definition of FPB_MERGE_YOUNG_DIRTY, or in
the handling of it in folio_pte_batch_flags().
I guess in the definition would make more sense.


Whether you decide to move forward on the comment or not (could be
squashed), I'm ok with this. 


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-07-02 10:49 ` [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
  2025-07-02 14:00   ` Oscar Salvador
@ 2025-07-03  9:16   ` Oscar Salvador
  2025-07-03 12:56   ` Dev Jain
  2025-08-04  8:22   ` Wei Yang
  3 siblings, 0 replies; 17+ messages in thread
From: Oscar Salvador @ 2025-07-03  9:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang

On Wed, Jul 02, 2025 at 12:49:26PM +0200, David Hildenbrand wrote:
> Instead, let's just allow for specifying through flags whether we want
> to have bits merged into the original PTE.
> 
> For the madvise() case, simplify by having only a single parameter for
> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
> dirty bit is not required, but also not harmful. This code is not that
> performance critical after all to really force all micro-optimizations.
> 
> As we now have two pte_t * parameters, use PageTable() to make sure we
> are actually given a pointer at a copy of the PTE, not a pointer into
> an actual page table.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Although I think it'd be nice to have a comment pointing out why dirty-young bites
go together:

Reviewed-by: Oscar Salvador <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/4] mm: convert FPB_IGNORE_* into FPB_RESPECT_*
  2025-07-02 10:49 ` [PATCH v2 1/4] mm: convert FPB_IGNORE_* into FPB_RESPECT_* David Hildenbrand
@ 2025-07-03 10:05   ` Dev Jain
  0 siblings, 0 replies; 17+ messages in thread
From: Dev Jain @ 2025-07-03 10:05 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo, Lance Yang,
	Oscar Salvador, Lance Yang


On 02/07/25 4:19 pm, David Hildenbrand wrote:
> Respecting these PTE bits is the exception, so let's invert the meaning.
>
> With this change, most callers don't have to pass any flags. This is
> a preparation for splitting folio_pte_batch() into a non-inlined
> variant that doesn't consume any flags.
>
> Long-term, we want folio_pte_batch() to probably ignore most common
> PTE bits (e.g., write/dirty/young/soft-dirty) that are not relevant for
> most page table walkers: uffd-wp and protnone might be bits to consider in
> the future. Only walkers that care about them can opt-in to respect them.
>
> No functional change intended.
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   

Reviewed-by: Dev Jain <dev.jain@arm.com>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/4] mm: smaller folio_pte_batch() improvements
  2025-07-02 10:49 ` [PATCH v2 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
  2025-07-02 14:24   ` Zi Yan
@ 2025-07-03 10:18   ` Dev Jain
  1 sibling, 0 replies; 17+ messages in thread
From: Dev Jain @ 2025-07-03 10:18 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo, Lance Yang,
	Oscar Salvador, Lance Yang


On 02/07/25 4:19 pm, David Hildenbrand wrote:
> Let's clean up a bit:
>
> (1) No need for start_ptep vs. ptep anymore, we can simply use ptep.
>
> (2) Let's switch to "unsigned int" for everything. Negative values do
>      not make sense.
>
> (3) We can simplify the code by leaving the pte unchanged after the
>      pte_same() check.
>
> (4) Clarify that we should never exceed a single VMA; it indicates a
>      problem in the caller.
>
> No functional change intended.
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   

Reviewed-by: Dev Jain <dev.jain@arm.com>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags()
  2025-07-02 10:49 ` [PATCH v2 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags() David Hildenbrand
  2025-07-02 14:25   ` Zi Yan
@ 2025-07-03 10:45   ` Dev Jain
  1 sibling, 0 replies; 17+ messages in thread
From: Dev Jain @ 2025-07-03 10:45 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo, Lance Yang,
	Oscar Salvador


On 02/07/25 4:19 pm, David Hildenbrand wrote:
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with the possible overhead of a function call.
>
> So let's provide a basic, non-inlined folio_pte_batch(), to avoid code
> bloat while still providing a variant that optimizes out all flag
> checks at runtime. folio_pte_batch_flags() will get inlined into
> folio_pte_batch(), optimizing out any conditionals that depend on input
> flags.
>
> folio_pte_batch() will behave like folio_pte_batch_flags() when no
> flags are specified. It's okay to add new users of
> folio_pte_batch_flags(), but using folio_pte_batch() if applicable is
> preferred.
>
> So, before this change, folio_pte_batch() was inlined into the C file
> optimized by propagating constants within the resulting object file.
>
> With this change, we now also have a folio_pte_batch() that is
> optimized by propagating all constants. But instead of having one instance
> per object file, we have a single shared one.
>
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
>
> While at it, drop the "addr" parameter that is unused.
>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   

Having one place to share the inlined copy is amazing! In the near
future I will be doing some GNU toolchain work so hopefully I gain
knowledge on compiler stuff; it was fun reading the inlined
vs non-inlined discussion.

Reviewed-by: Dev Jain <dev.jain@arm.com>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-07-02 10:49 ` [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
  2025-07-02 14:00   ` Oscar Salvador
  2025-07-03  9:16   ` Oscar Salvador
@ 2025-07-03 12:56   ` Dev Jain
  2025-08-04  8:22   ` Wei Yang
  3 siblings, 0 replies; 17+ messages in thread
From: Dev Jain @ 2025-07-03 12:56 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Pedro Falcato, Rik van Riel, Harry Yoo, Lance Yang,
	Oscar Salvador


On 02/07/25 4:19 pm, David Hildenbrand wrote:
> Instead, let's just allow for specifying through flags whether we want
> to have bits merged into the original PTE.
>
> For the madvise() case, simplify by having only a single parameter for
> merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
> dirty bit is not required, but also not harmful. This code is not that
> performance critical after all to really force all micro-optimizations.

Makes sense.

>
> As we now have two pte_t * parameters, use PageTable() to make sure we
> are actually given a pointer at a copy of the PTE, not a pointer into
> an actual page table.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   

Reviewed-by: Dev Jain <dev.jain@arm.com>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-07-02 10:49 ` [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
                     ` (2 preceding siblings ...)
  2025-07-03 12:56   ` Dev Jain
@ 2025-08-04  8:22   ` Wei Yang
  2025-08-04  8:23     ` David Hildenbrand
  3 siblings, 1 reply; 17+ messages in thread
From: Wei Yang @ 2025-08-04  8:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador

Nit in subject.

We have renamed the function to folio_pte_batch_flags().

Not sure it is too late.


-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext()
  2025-08-04  8:22   ` Wei Yang
@ 2025-08-04  8:23     ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-08-04  8:23 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-kernel, linux-mm, Andrew Morton, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Pedro Falcato, Rik van Riel, Harry Yoo,
	Lance Yang, Oscar Salvador

On 04.08.25 10:22, Wei Yang wrote:
> Nit in subject.
> 
> We have renamed the function to folio_pte_batch_flags().

Right.

> 
> Not sure it is too late.

Already upsteam.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-08-04  8:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 10:49 [PATCH v2 0/4] mm: folio_pte_batch() improvements David Hildenbrand
2025-07-02 10:49 ` [PATCH v2 1/4] mm: convert FPB_IGNORE_* into FPB_RESPECT_* David Hildenbrand
2025-07-03 10:05   ` Dev Jain
2025-07-02 10:49 ` [PATCH v2 2/4] mm: smaller folio_pte_batch() improvements David Hildenbrand
2025-07-02 14:24   ` Zi Yan
2025-07-03 10:18   ` Dev Jain
2025-07-02 10:49 ` [PATCH v2 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags() David Hildenbrand
2025-07-02 14:25   ` Zi Yan
2025-07-03 10:45   ` Dev Jain
2025-07-02 10:49 ` [PATCH v2 4/4] mm: remove boolean output parameters from folio_pte_batch_ext() David Hildenbrand
2025-07-02 14:00   ` Oscar Salvador
2025-07-02 14:40     ` David Hildenbrand
2025-07-02 17:53       ` Oscar Salvador
2025-07-03  9:16   ` Oscar Salvador
2025-07-03 12:56   ` Dev Jain
2025-08-04  8:22   ` Wei Yang
2025-08-04  8:23     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).