dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
@ 2025-04-25  8:17 David Hildenbrand
  2025-04-25  8:17 ` [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
                   ` (11 more replies)
  0 siblings, 12 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

On top of mm-unstable.

VM_PAT annoyed me too much and wasted too much of my time, let's clean
PAT handling up and remove VM_PAT.

This should sort out various issues with VM_PAT we discovered recently,
and will hopefully make the whole code more stable and easier to maintain.

In essence: we stop letting PAT mode mess with VMAs and instead lift
what to track/untrack to the MM core. We remember per VMA which pfn range
we tracked in a new struct we attach to a VMA (we have space without
exceeding 192 bytes), use a kref to share it among VMAs during
split/mremap/fork, and automatically untrack once the kref drops to 0.

This implies that we'll keep tracking a full pfn range even after partially
unmapping it, until fully unmapping it; but as that case was mostly broken
before, this at least makes it work in a way that is least intrusive to
VMA handling.

Shrinking with mremap() used to work in a hacky way, now we'll similarly
keep the original pfn range tacked even after this form of partial unmap.
Does anybody care about that? Unlikely. If we run into issues, we could
likely handled that (adjust the tracking) when our kref drops to 1 while
freeing a VMA. But it adds more complexity, so avoid that for now.

Briefly tested

There will be some clash with [1], but nothing that cannot be sorted out
easily by moving the functions added to kernel/fork.c to wherever the vma
bits will live.

Briefly tested with some basic /dev/mem test I crafted. I want to convert
them to selftests, but that might or might not require a bit of
more work (e.g., /dev/mem accessibility).

[1] lkml.kernel.org/r/cover.1745528282.git.lorenzo.stoakes@oracle.com

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Peter Xu <peterx@redhat.com>

David Hildenbrand (11):
  x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
  mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack()
  mm/memremap: convert to pfnmap_track() + pfnmap_untrack()
  mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  x86/mm/pat: remove old pfnmap tracking interface
  mm: remove VM_PAT
  x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
  x86/mm/pat: remove MEMTYPE_*_MATCH
  drm/i915: track_pfn() -> "pfnmap tracking"
  mm/io-mapping: track_pfn() -> "pfnmap tracking"

 arch/x86/mm/pat/memtype.c          | 194 ++++-------------------------
 arch/x86/mm/pat/memtype_interval.c |  44 +------
 drivers/gpu/drm/i915/i915_mm.c     |   4 +-
 include/linux/mm.h                 |   4 +-
 include/linux/mm_inline.h          |   2 +
 include/linux/mm_types.h           |  11 ++
 include/linux/pgtable.h            | 101 ++++++---------
 include/trace/events/mmflags.h     |   4 +-
 kernel/fork.c                      |  54 +++++++-
 mm/huge_memory.c                   |   7 +-
 mm/io-mapping.c                    |   2 +-
 mm/memory.c                        |  85 ++++++++++---
 mm/memremap.c                      |   8 +-
 mm/mremap.c                        |   4 -
 14 files changed, 212 insertions(+), 312 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-28 16:16   ` Lorenzo Stoakes
  2025-04-25  8:17 ` [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot() David Hildenbrand
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

Let's factor it out to make the code easier to grasp.

Use it also in pgprot_writecombine()/pgprot_writethrough() where
clearing the old cachemode might not be required, but given that we are
already doing a function call, no need to care about this
micro-optimization.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 72d8cbc611583..edec5859651d6 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -800,6 +800,12 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size)
 }
 #endif /* CONFIG_STRICT_DEVMEM */
 
+static inline void pgprot_set_cachemode(pgprot_t *prot, enum page_cache_mode pcm)
+{
+	*prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) |
+			 cachemode2protval(pcm));
+}
+
 int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
 				unsigned long size, pgprot_t *vma_prot)
 {
@@ -811,8 +817,7 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
 	if (file->f_flags & O_DSYNC)
 		pcm = _PAGE_CACHE_MODE_UC_MINUS;
 
-	*vma_prot = __pgprot((pgprot_val(*vma_prot) & ~_PAGE_CACHE_MASK) |
-			     cachemode2protval(pcm));
+	pgprot_set_cachemode(vma_prot, pcm);
 	return 1;
 }
 
@@ -880,9 +885,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
 				(unsigned long long)paddr,
 				(unsigned long long)(paddr + size - 1),
 				cattr_name(pcm));
-			*vma_prot = __pgprot((pgprot_val(*vma_prot) &
-					     (~_PAGE_CACHE_MASK)) |
-					     cachemode2protval(pcm));
+			pgprot_set_cachemode(vma_prot, pcm);
 		}
 		return 0;
 	}
@@ -907,9 +910,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
 		 * We allow returning different type than the one requested in
 		 * non strict case.
 		 */
-		*vma_prot = __pgprot((pgprot_val(*vma_prot) &
-				      (~_PAGE_CACHE_MASK)) |
-				     cachemode2protval(pcm));
+		pgprot_set_cachemode(vma_prot, pcm);
 	}
 
 	if (memtype_kernel_map_sync(paddr, size, pcm) < 0) {
@@ -1060,9 +1061,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			return -EINVAL;
 	}
 
-	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
-			 cachemode2protval(pcm));
-
+	pgprot_set_cachemode(prot, pcm);
 	return 0;
 }
 
@@ -1073,10 +1072,8 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
 	if (!pat_enabled())
 		return;
 
-	/* Set prot based on lookup */
 	pcm = lookup_memtype(pfn_t_to_phys(pfn));
-	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
-			 cachemode2protval(pcm));
+	pgprot_set_cachemode(prot, pcm);
 }
 
 /*
@@ -1115,15 +1112,15 @@ void untrack_pfn_clear(struct vm_area_struct *vma)
 
 pgprot_t pgprot_writecombine(pgprot_t prot)
 {
-	return __pgprot(pgprot_val(prot) |
-				cachemode2protval(_PAGE_CACHE_MODE_WC));
+	pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC);
+	return prot;
 }
 EXPORT_SYMBOL_GPL(pgprot_writecombine);
 
 pgprot_t pgprot_writethrough(pgprot_t prot)
 {
-	return __pgprot(pgprot_val(prot) |
-				cachemode2protval(_PAGE_CACHE_MODE_WT));
+	pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WT);
+	return prot;
 }
 EXPORT_SYMBOL_GPL(pgprot_writethrough);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
  2025-04-25  8:17 ` [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-25 19:31   ` Peter Xu
  2025-04-25  8:17 ` [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack() David Hildenbrand
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

... by factoring it out from track_pfn_remap().

For PMDs/PUDs, actually check the full range, and trigger a fallback
if we run into this "different memory types / cachemodes" scenario.

Add some documentation.

Will checking each page result in undesired overhead? We'll have to
learn. Not checking each page looks wrong, though. Maybe we could
optimize the lookup internally.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 24 ++++++++----------------
 include/linux/pgtable.h   | 28 ++++++++++++++++++++--------
 mm/huge_memory.c          |  7 +++++--
 mm/memory.c               |  4 ++--
 4 files changed, 35 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index edec5859651d6..193e33251b18f 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1031,7 +1031,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 		    unsigned long pfn, unsigned long addr, unsigned long size)
 {
 	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
-	enum page_cache_mode pcm;
 
 	/* reserve the whole chunk starting from paddr */
 	if (!vma || (addr == vma->vm_start
@@ -1044,13 +1043,17 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 		return ret;
 	}
 
+	return pfnmap_sanitize_pgprot(pfn, size, prot);
+}
+
+int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size, pgprot_t *prot)
+{
+	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
+	enum page_cache_mode pcm;
+
 	if (!pat_enabled())
 		return 0;
 
-	/*
-	 * For anything smaller than the vma size we set prot based on the
-	 * lookup.
-	 */
 	pcm = lookup_memtype(paddr);
 
 	/* Check memtype for the remaining pages */
@@ -1065,17 +1068,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 	return 0;
 }
 
-void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
-{
-	enum page_cache_mode pcm;
-
-	if (!pat_enabled())
-		return;
-
-	pcm = lookup_memtype(pfn_t_to_phys(pfn));
-	pgprot_set_cachemode(prot, pcm);
-}
-
 /*
  * untrack_pfn is called while unmapping a pfnmap for a region.
  * untrack can be called for a specific region indicated by pfn and size or
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c921..91aadfe2515a5 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1500,13 +1500,10 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 	return 0;
 }
 
-/*
- * track_pfn_insert is called when a _new_ single pfn is established
- * by vmf_insert_pfn().
- */
-static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-				    pfn_t pfn)
+static inline int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
+		pgprot_t *prot)
 {
+	return 0;
 }
 
 /*
@@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
 extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			   unsigned long pfn, unsigned long addr,
 			   unsigned long size);
-extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-			     pfn_t pfn);
+
+/**
+ * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
+ * @pfn: the start of the pfn range
+ * @size: the size of the pfn range
+ * @prot: the pgprot to sanitize
+ *
+ * Sanitize the given pgprot for a pfn range, for example, adjusting the
+ * cachemode.
+ *
+ * This function cannot fail for a single page, but can fail for multiple
+ * pages.
+ *
+ * Returns 0 on success and -EINVAL on error.
+ */
+int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
+		pgprot_t *prot);
 extern int track_pfn_copy(struct vm_area_struct *dst_vma,
 		struct vm_area_struct *src_vma, unsigned long *pfn);
 extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fdcf0a6049b9f..b8ae5e1493315 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 			return VM_FAULT_OOM;
 	}
 
-	track_pfn_insert(vma, &pgprot, pfn);
+	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
+		return VM_FAULT_FALLBACK;
+
 	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
 			pgtable);
@@ -1577,7 +1579,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	track_pfn_insert(vma, &pgprot, pfn);
+	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
+		return VM_FAULT_FALLBACK;
 
 	ptl = pud_lock(vma->vm_mm, vmf->pud);
 	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
diff --git a/mm/memory.c b/mm/memory.c
index 424420349bd3c..c737a8625866a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2563,7 +2563,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 	if (!pfn_modify_allowed(pfn, pgprot))
 		return VM_FAULT_SIGBUS;
 
-	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
+	pfnmap_sanitize_pgprot(pfn, PAGE_SIZE, &pgprot);
 
 	return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
 			false);
@@ -2626,7 +2626,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	track_pfn_insert(vma, &pgprot, pfn);
+	pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
 
 	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
 		return VM_FAULT_SIGBUS;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack()
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
  2025-04-25  8:17 ` [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
  2025-04-25  8:17 ` [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot() David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-28 16:53   ` Lorenzo Stoakes
  2025-04-25  8:17 ` [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack() David Hildenbrand
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

Let's provide variants of track_pfn_remap() and untrack_pfn() that won't
mess with VMAs, to replace the existing interface step-by-step.

Add some documentation.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 14 ++++++++++++++
 include/linux/pgtable.h   | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 193e33251b18f..c011d8dd8f441 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1068,6 +1068,20 @@ int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size, pgprot_t *prot
 	return 0;
 }
 
+int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
+{
+	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
+
+	return reserve_pfn_range(paddr, size, prot, 0);
+}
+
+void pfnmap_untrack(unsigned long pfn, unsigned long size)
+{
+	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
+
+	free_pfn_range(paddr, size);
+}
+
 /*
  * untrack_pfn is called while unmapping a pfnmap for a region.
  * untrack can be called for a specific region indicated by pfn and size or
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 91aadfe2515a5..898a3ab195578 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1506,6 +1506,16 @@ static inline int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
 	return 0;
 }
 
+static inline int pfnmap_track(unsigned long pfn, unsigned long size,
+		pgprot_t *prot)
+{
+	return 0;
+}
+
+static inline void pfnmap_untrack(unsigned long pfn, unsigned long size)
+{
+}
+
 /*
  * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
  * tables copied during copy_page_range(). Will store the pfn to be
@@ -1570,6 +1580,29 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
  */
 int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
 		pgprot_t *prot);
+
+/**
+ * pfnmap_track - track a pfn range
+ * @pfn: the start of the pfn range
+ * @size: the size of the pfn range
+ * @prot: the pgprot to track
+ *
+ * Tracking a pfnmap range involves conditionally reserving a pfn range and
+ * sanitizing the pgprot -- see pfnmap_sanitize_pgprot().
+ *
+ * Returns 0 on success and -EINVAL on error.
+ */
+int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
+
+/**
+ * pfnmap_untrack - untrack a pfn range
+ * @pfn: the start of the pfn range
+ * @size: the size of the pfn range
+ *
+ * Untrack a pfn range previously tracked through pfnmap_track(), for example,
+ * un-doing any reservation.
+ */
+void pfnmap_untrack(unsigned long pfn, unsigned long size);
 extern int track_pfn_copy(struct vm_area_struct *dst_vma,
 		struct vm_area_struct *src_vma, unsigned long *pfn);
 extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack()
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack() David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-25 20:00   ` Peter Xu
  2025-04-25  8:17 ` [PATCH v1 05/11] mm: convert VM_PFNMAP tracking " David Hildenbrand
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

Let's use the new, cleaner interface.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memremap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 2aebc1b192da9..c417c843e9b1f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
 	}
 	mem_hotplug_done();
 
-	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
+	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
 	pgmap_array_delete(range);
 }
 
@@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	if (nid < 0)
 		nid = numa_mem_id();
 
-	error = track_pfn_remap(NULL, &params->pgprot, PHYS_PFN(range->start), 0,
-			range_len(range));
+	error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
+			     &params->pgprot);
 	if (error)
 		goto err_pfn_remap;
 
@@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	if (!is_private)
 		kasan_remove_zero_shadow(__va(range->start), range_len(range));
 err_kasan:
-	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
+	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
 err_pfn_remap:
 	pgmap_array_delete(range);
 	return error;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack() David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-25 20:23   ` Peter Xu
                     ` (2 more replies)
  2025-04-25  8:17 ` [PATCH v1 06/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

Let's use our new interface. In remap_pfn_range(), we'll now decide
whether we have to track (full VMA covered) or only sanitize the pgprot
(partial VMA covered).

Remember what we have to untrack by linking it from the VMA. When
duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
to anon VMA names, and use a kref to share the tracking.

Once the last VMA un-refs our tracking data, we'll do the untracking,
which simplifies things a lot and should sort our various issues we saw
recently, for example, when partially unmapping/zapping a tracked VMA.

This change implies that we'll keep tracking the original PFN range even
after splitting + partially unmapping it: not too bad, because it was
not working reliably before. The only thing that kind-of worked before
was shrinking such a mapping using mremap(): we managed to adjust the
reservation in a hacky way, now we won't adjust the reservation but
leave it around until all involved VMAs are gone.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm_inline.h |  2 +
 include/linux/mm_types.h  | 11 ++++++
 kernel/fork.c             | 54 ++++++++++++++++++++++++--
 mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
 mm/mremap.c               |  4 --
 5 files changed, 128 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f9157a0c42a5c..89b518ff097e6 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
 
 #endif  /* CONFIG_ANON_VMA_NAME */
 
+void pfnmap_track_ctx_release(struct kref *ref);
+
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
 	atomic_set(&mm->tlb_flush_pending, 0);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f91..91124761cfda8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -764,6 +764,14 @@ struct vma_numab_state {
 	int prev_scan_seq;
 };
 
+#ifdef __HAVE_PFNMAP_TRACKING
+struct pfnmap_track_ctx {
+	struct kref kref;
+	unsigned long pfn;
+	unsigned long size;
+};
+#endif
+
 /*
  * This struct describes a virtual memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -877,6 +885,9 @@ struct vm_area_struct {
 	struct anon_vma_name *anon_name;
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef __HAVE_PFNMAP_TRACKING
+	struct pfnmap_track_ctx *pfnmap_track_ctx;
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_NUMA
diff --git a/kernel/fork.c b/kernel/fork.c
index 168681fc4b25a..ae518b8fe752c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,7 +481,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
 #ifdef CONFIG_NUMA
 	dest->vm_policy = src->vm_policy;
 #endif
+#ifdef __HAVE_PFNMAP_TRACKING
+	dest->pfnmap_track_ctx = NULL;
+#endif
+}
+
+#ifdef __HAVE_PFNMAP_TRACKING
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+		struct vm_area_struct *new)
+{
+	struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
+
+	if (likely(!ctx))
+		return 0;
+
+	/*
+	 * We don't expect to ever hit this. If ever required, we would have
+	 * to duplicate the tracking.
+	 */
+	if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
+		return -ENOMEM;
+	kref_get(&ctx->kref);
+	new->pfnmap_track_ctx = ctx;
+	return 0;
+}
+
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+	struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
+
+	if (likely(!ctx))
+		return;
+
+	kref_put(&ctx->kref, pfnmap_track_ctx_release);
+	vma->pfnmap_track_ctx = NULL;
+}
+#else
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+		struct vm_area_struct *new)
+{
+	return 0;
 }
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+}
+#endif
 
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 {
@@ -493,6 +537,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
 	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
 	vm_area_init_from(orig, new);
+
+	if (vma_pfnmap_track_ctx_dup(orig, new)) {
+		kmem_cache_free(vm_area_cachep, new);
+		return NULL;
+	}
 	vma_lock_init(new, true);
 	INIT_LIST_HEAD(&new->anon_vma_chain);
 	vma_numab_state_init(new);
@@ -507,6 +556,7 @@ void vm_area_free(struct vm_area_struct *vma)
 	vma_assert_detached(vma);
 	vma_numab_state_free(vma);
 	free_anon_vma_name(vma);
+	vma_pfnmap_track_ctx_release(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -669,10 +719,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		if (!tmp)
 			goto fail_nomem;
 
-		/* track_pfn_copy() will later take care of copying internal state. */
-		if (unlikely(tmp->vm_flags & VM_PFNMAP))
-			untrack_pfn_clear(tmp);
-
 		retval = vma_dup_policy(mpnt, tmp);
 		if (retval)
 			goto fail_nomem_policy;
diff --git a/mm/memory.c b/mm/memory.c
index c737a8625866a..eb2b3f10a97ec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1370,7 +1370,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	struct mm_struct *dst_mm = dst_vma->vm_mm;
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	struct mmu_notifier_range range;
-	unsigned long next, pfn = 0;
+	unsigned long next;
 	bool is_cow;
 	int ret;
 
@@ -1380,12 +1380,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	if (is_vm_hugetlb_page(src_vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
 
-	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
-		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
-		if (ret)
-			return ret;
-	}
-
 	/*
 	 * We need to invalidate the secondary MMU mappings only when
 	 * there could be a permission downgrade on the ptes of the
@@ -1427,8 +1421,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		raw_write_seqcount_end(&src_mm->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
 	}
-	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
-		untrack_pfn_copy(dst_vma, pfn);
 	return ret;
 }
 
@@ -1923,9 +1915,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 	if (vma->vm_file)
 		uprobe_munmap(vma, start, end);
 
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn(vma, 0, 0, mm_wr_locked);
-
 	if (start != end) {
 		if (unlikely(is_vm_hugetlb_page(vma))) {
 			/*
@@ -2871,6 +2860,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
 	return error;
 }
 
+#ifdef __HAVE_PFNMAP_TRACKING
+static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
+		unsigned long size, pgprot_t *prot)
+{
+	struct pfnmap_track_ctx *ctx;
+
+	if (pfnmap_track(pfn, size, prot))
+		return ERR_PTR(-EINVAL);
+
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (unlikely(!ctx)) {
+		pfnmap_untrack(pfn, size);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pfn = pfn;
+	ctx->size = size;
+	kref_init(&ctx->kref);
+	return ctx;
+}
+
+void pfnmap_track_ctx_release(struct kref *ref)
+{
+	struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
+
+	pfnmap_untrack(ctx->pfn, ctx->size);
+	kfree(ctx);
+}
+#endif /* __HAVE_PFNMAP_TRACKING */
+
 /**
  * remap_pfn_range - remap kernel memory to userspace
  * @vma: user vma to map to
@@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
  *
  * Return: %0 on success, negative error code otherwise.
  */
+#ifdef __HAVE_PFNMAP_TRACKING
 int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		    unsigned long pfn, unsigned long size, pgprot_t prot)
 {
+	struct pfnmap_track_ctx *ctx = NULL;
 	int err;
 
-	err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
-	if (err)
+	size = PAGE_ALIGN(size);
+
+	/*
+	 * If we cover the full VMA, we'll perform actual tracking, and
+	 * remember to untrack when the last reference to our tracking
+	 * context from a VMA goes away.
+	 *
+	 * If we only cover parts of the VMA, we'll only sanitize the
+	 * pgprot.
+	 */
+	if (addr == vma->vm_start && addr + size == vma->vm_end) {
+		if (vma->pfnmap_track_ctx)
+			return -EINVAL;
+		ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
+		if (IS_ERR(ctx))
+			return PTR_ERR(ctx);
+	} else if (pfnmap_sanitize_pgprot(pfn, size, &prot)) {
 		return -EINVAL;
+	}
 
 	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
-	if (err)
-		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
+	if (ctx) {
+		if (err)
+			kref_put(&ctx->kref, pfnmap_track_ctx_release);
+		else
+			vma->pfnmap_track_ctx = ctx;
+	}
 	return err;
 }
+
+#else
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+		    unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+}
+#endif
 EXPORT_SYMBOL(remap_pfn_range);
 
 /**
diff --git a/mm/mremap.c b/mm/mremap.c
index 7db9da609c84f..6e78e02f74bd3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
 	if (is_vm_hugetlb_page(vma))
 		clear_vma_resv_huge_pages(vma);
 
-	/* Tell pfnmap has moved from this vma */
-	if (unlikely(vma->vm_flags & VM_PFNMAP))
-		untrack_pfn_clear(vma);
-
 	*new_vma_ptr = new_vma;
 	return err;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 06/11] x86/mm/pat: remove old pfnmap tracking interface
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (4 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 05/11] mm: convert VM_PFNMAP tracking " David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-28 20:12   ` Lorenzo Stoakes
  2025-04-25  8:17 ` [PATCH v1 07/11] mm: remove VM_PAT David Hildenbrand
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

We can now get rid of the old interface along with get_pat_info() and
follow_phys().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 147 --------------------------------------
 include/linux/pgtable.h   |  66 -----------------
 2 files changed, 213 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index c011d8dd8f441..668ebf0065157 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -933,119 +933,6 @@ static void free_pfn_range(u64 paddr, unsigned long size)
 		memtype_free(paddr, paddr + size);
 }
 
-static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
-		resource_size_t *phys)
-{
-	struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start };
-
-	if (follow_pfnmap_start(&args))
-		return -EINVAL;
-
-	/* Never return PFNs of anon folios in COW mappings. */
-	if (!args.special) {
-		follow_pfnmap_end(&args);
-		return -EINVAL;
-	}
-
-	*prot = pgprot_val(args.pgprot);
-	*phys = (resource_size_t)args.pfn << PAGE_SHIFT;
-	follow_pfnmap_end(&args);
-	return 0;
-}
-
-static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
-		pgprot_t *pgprot)
-{
-	unsigned long prot;
-
-	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT));
-
-	/*
-	 * We need the starting PFN and cachemode used for track_pfn_remap()
-	 * that covered the whole VMA. For most mappings, we can obtain that
-	 * information from the page tables. For COW mappings, we might now
-	 * suddenly have anon folios mapped and follow_phys() will fail.
-	 *
-	 * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to
-	 * detect the PFN. If we need the cachemode as well, we're out of luck
-	 * for now and have to fail fork().
-	 */
-	if (!follow_phys(vma, &prot, paddr)) {
-		if (pgprot)
-			*pgprot = __pgprot(prot);
-		return 0;
-	}
-	if (is_cow_mapping(vma->vm_flags)) {
-		if (pgprot)
-			return -EINVAL;
-		*paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT;
-		return 0;
-	}
-	WARN_ON_ONCE(1);
-	return -EINVAL;
-}
-
-int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma, unsigned long *pfn)
-{
-	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
-	resource_size_t paddr;
-	pgprot_t pgprot;
-	int rc;
-
-	if (!(src_vma->vm_flags & VM_PAT))
-		return 0;
-
-	/*
-	 * Duplicate the PAT information for the dst VMA based on the src
-	 * VMA.
-	 */
-	if (get_pat_info(src_vma, &paddr, &pgprot))
-		return -EINVAL;
-	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
-	if (rc)
-		return rc;
-
-	/* Reservation for the destination VMA succeeded. */
-	vm_flags_set(dst_vma, VM_PAT);
-	*pfn = PHYS_PFN(paddr);
-	return 0;
-}
-
-void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
-{
-	untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
-	/*
-	 * Reservation was freed, any copied page tables will get cleaned
-	 * up later, but without getting PAT involved again.
-	 */
-}
-
-/*
- * prot is passed in as a parameter for the new mapping. If the vma has
- * a linear pfn mapping for the entire range, or no vma is provided,
- * reserve the entire pfn + size range with single reserve_pfn_range
- * call.
- */
-int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
-		    unsigned long pfn, unsigned long addr, unsigned long size)
-{
-	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
-
-	/* reserve the whole chunk starting from paddr */
-	if (!vma || (addr == vma->vm_start
-				&& size == (vma->vm_end - vma->vm_start))) {
-		int ret;
-
-		ret = reserve_pfn_range(paddr, size, prot, 0);
-		if (ret == 0 && vma)
-			vm_flags_set(vma, VM_PAT);
-		return ret;
-	}
-
-	return pfnmap_sanitize_pgprot(pfn, size, prot);
-}
-
 int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size, pgprot_t *prot)
 {
 	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
@@ -1082,40 +969,6 @@ void pfnmap_untrack(unsigned long pfn, unsigned long size)
 	free_pfn_range(paddr, size);
 }
 
-/*
- * untrack_pfn is called while unmapping a pfnmap for a region.
- * untrack can be called for a specific region indicated by pfn and size or
- * can be for the entire vma (in which case pfn, size are zero).
- */
-void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-		 unsigned long size, bool mm_wr_locked)
-{
-	resource_size_t paddr;
-
-	if (vma && !(vma->vm_flags & VM_PAT))
-		return;
-
-	/* free the chunk starting from pfn or the whole chunk */
-	paddr = (resource_size_t)pfn << PAGE_SHIFT;
-	if (!paddr && !size) {
-		if (get_pat_info(vma, &paddr, NULL))
-			return;
-		size = vma->vm_end - vma->vm_start;
-	}
-	free_pfn_range(paddr, size);
-	if (vma) {
-		if (mm_wr_locked)
-			vm_flags_clear(vma, VM_PAT);
-		else
-			__vm_flags_mod(vma, 0, VM_PAT);
-	}
-}
-
-void untrack_pfn_clear(struct vm_area_struct *vma)
-{
-	vm_flags_clear(vma, VM_PAT);
-}
-
 pgprot_t pgprot_writecombine(pgprot_t prot)
 {
 	pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 898a3ab195578..0ffc6b9339182 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1489,17 +1489,6 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
  * vmf_insert_pfn.
  */
 
-/*
- * track_pfn_remap is called when a _new_ pfn mapping is being established
- * by remap_pfn_range() for physical range indicated by pfn and size.
- */
-static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
-				  unsigned long pfn, unsigned long addr,
-				  unsigned long size)
-{
-	return 0;
-}
-
 static inline int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
 		pgprot_t *prot)
 {
@@ -1515,55 +1504,7 @@ static inline int pfnmap_track(unsigned long pfn, unsigned long size,
 static inline void pfnmap_untrack(unsigned long pfn, unsigned long size)
 {
 }
-
-/*
- * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
- * tables copied during copy_page_range(). Will store the pfn to be
- * passed to untrack_pfn_copy() only if there is something to be untracked.
- * Callers should initialize the pfn to 0.
- */
-static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma, unsigned long *pfn)
-{
-	return 0;
-}
-
-/*
- * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
- * copy_page_range(), but after track_pfn_copy() was already called. Can
- * be called even if track_pfn_copy() did not actually track anything:
- * handled internally.
- */
-static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
-		unsigned long pfn)
-{
-}
-
-/*
- * untrack_pfn is called while unmapping a pfnmap for a region.
- * untrack can be called for a specific region indicated by pfn and size or
- * can be for the entire vma (in which case pfn, size are zero).
- */
-static inline void untrack_pfn(struct vm_area_struct *vma,
-			       unsigned long pfn, unsigned long size,
-			       bool mm_wr_locked)
-{
-}
-
-/*
- * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
- *
- * 1) During mremap() on the src VMA after the page tables were moved.
- * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
- */
-static inline void untrack_pfn_clear(struct vm_area_struct *vma)
-{
-}
 #else
-extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
-			   unsigned long pfn, unsigned long addr,
-			   unsigned long size);
-
 /**
  * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
  * @pfn: the start of the pfn range
@@ -1603,13 +1544,6 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
  * un-doing any reservation.
  */
 void pfnmap_untrack(unsigned long pfn, unsigned long size);
-extern int track_pfn_copy(struct vm_area_struct *dst_vma,
-		struct vm_area_struct *src_vma, unsigned long *pfn);
-extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
-		unsigned long pfn);
-extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-			unsigned long size, bool mm_wr_locked);
-extern void untrack_pfn_clear(struct vm_area_struct *vma);
 #endif
 
 #ifdef CONFIG_MMU
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 07/11] mm: remove VM_PAT
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (5 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 06/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-28 20:16   ` Lorenzo Stoakes
  2025-04-25  8:17 ` [PATCH v1 08/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

It's unused, so let's remove it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h             | 4 +---
 include/trace/events/mmflags.h | 4 +---
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9b701cfbef223..a205020e2a58b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -357,9 +357,7 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_SHADOW_STACK	VM_NONE
 #endif
 
-#if defined(CONFIG_X86)
-# define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
-#elif defined(CONFIG_PPC64)
+#if defined(CONFIG_PPC64)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
 # define VM_GROWSUP	VM_ARCH_1
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 15aae955a10bf..aa441f593e9a6 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -172,9 +172,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
 	__def_pageflag_names						\
 	) : "none"
 
-#if defined(CONFIG_X86)
-#define __VM_ARCH_SPECIFIC_1 {VM_PAT,     "pat"           }
-#elif defined(CONFIG_PPC64)
+#if defined(CONFIG_PPC64)
 #define __VM_ARCH_SPECIFIC_1 {VM_SAO,     "sao"           }
 #elif defined(CONFIG_PARISC)
 #define __VM_ARCH_SPECIFIC_1 {VM_GROWSUP,	"growsup"	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 08/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (6 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 07/11] mm: remove VM_PAT David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-28 20:18   ` Lorenzo Stoakes
  2025-04-25  8:17 ` [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

Always set to 0, so let's remove it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 668ebf0065157..57e3ced4c28cb 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -858,8 +858,7 @@ int memtype_kernel_map_sync(u64 base, unsigned long size,
  * Reserved non RAM regions only and after successful memtype_reserve,
  * this func also keeps identity mapping (if any) in sync with this new prot.
  */
-static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
-				int strict_prot)
+static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot)
 {
 	int is_ram = 0;
 	int ret;
@@ -895,8 +894,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
 		return ret;
 
 	if (pcm != want_pcm) {
-		if (strict_prot ||
-		    !is_new_memtype_allowed(paddr, size, want_pcm, pcm)) {
+		if (!is_new_memtype_allowed(paddr, size, want_pcm, pcm)) {
 			memtype_free(paddr, paddr + size);
 			pr_err("x86/PAT: %s:%d map pfn expected mapping type %s for [mem %#010Lx-%#010Lx], got %s\n",
 			       current->comm, current->pid,
@@ -906,10 +904,6 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
 			       cattr_name(pcm));
 			return -EINVAL;
 		}
-		/*
-		 * We allow returning different type than the one requested in
-		 * non strict case.
-		 */
 		pgprot_set_cachemode(vma_prot, pcm);
 	}
 
@@ -959,7 +953,7 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
 {
 	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
 
-	return reserve_pfn_range(paddr, size, prot, 0);
+	return reserve_pfn_range(paddr, size, prot);
 }
 
 void pfnmap_untrack(unsigned long pfn, unsigned long size)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (7 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 08/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-28 20:23   ` Lorenzo Stoakes
  2025-04-25  8:17 ` [PATCH v1 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

The "memramp() shrinking" scenario no longer applies, so let's remove
that now-unnecessary handling.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/pat/memtype_interval.c | 44 ++++--------------------------
 1 file changed, 6 insertions(+), 38 deletions(-)

diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
index 645613d59942a..9d03f0dbc4715 100644
--- a/arch/x86/mm/pat/memtype_interval.c
+++ b/arch/x86/mm/pat/memtype_interval.c
@@ -49,26 +49,15 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
 
 static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
 
-enum {
-	MEMTYPE_EXACT_MATCH	= 0,
-	MEMTYPE_END_MATCH	= 1
-};
-
-static struct memtype *memtype_match(u64 start, u64 end, int match_type)
+static struct memtype *memtype_match(u64 start, u64 end)
 {
 	struct memtype *entry_match;
 
 	entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
 
 	while (entry_match != NULL && entry_match->start < end) {
-		if ((match_type == MEMTYPE_EXACT_MATCH) &&
-		    (entry_match->start == start) && (entry_match->end == end))
-			return entry_match;
-
-		if ((match_type == MEMTYPE_END_MATCH) &&
-		    (entry_match->start < start) && (entry_match->end == end))
+		if (entry_match->start == start && entry_match->end == end)
 			return entry_match;
-
 		entry_match = interval_iter_next(entry_match, start, end-1);
 	}
 
@@ -132,32 +121,11 @@ struct memtype *memtype_erase(u64 start, u64 end)
 {
 	struct memtype *entry_old;
 
-	/*
-	 * Since the memtype_rbroot tree allows overlapping ranges,
-	 * memtype_erase() checks with EXACT_MATCH first, i.e. free
-	 * a whole node for the munmap case.  If no such entry is found,
-	 * it then checks with END_MATCH, i.e. shrink the size of a node
-	 * from the end for the mremap case.
-	 */
-	entry_old = memtype_match(start, end, MEMTYPE_EXACT_MATCH);
-	if (!entry_old) {
-		entry_old = memtype_match(start, end, MEMTYPE_END_MATCH);
-		if (!entry_old)
-			return ERR_PTR(-EINVAL);
-	}
-
-	if (entry_old->start == start) {
-		/* munmap: erase this node */
-		interval_remove(entry_old, &memtype_rbroot);
-	} else {
-		/* mremap: update the end value of this node */
-		interval_remove(entry_old, &memtype_rbroot);
-		entry_old->end = start;
-		interval_insert(entry_old, &memtype_rbroot);
-
-		return NULL;
-	}
+	entry_old = memtype_match(start, end);
+	if (!entry_old)
+		return ERR_PTR(-EINVAL);
 
+	interval_remove(entry_old, &memtype_rbroot);
 	return entry_old;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 10/11] drm/i915: track_pfn() -> "pfnmap tracking"
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (8 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-28 20:23   ` Lorenzo Stoakes
  2025-04-25  8:17 ` [PATCH v1 11/11] mm/io-mapping: " David Hildenbrand
  2025-04-25  8:54 ` [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Ingo Molnar
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

track_pfn() does not exist, let's simply refer to it as "pfnmap
tracking".

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/gpu/drm/i915/i915_mm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_mm.c b/drivers/gpu/drm/i915/i915_mm.c
index 76e2801619f09..c33bd3d830699 100644
--- a/drivers/gpu/drm/i915/i915_mm.c
+++ b/drivers/gpu/drm/i915/i915_mm.c
@@ -100,7 +100,7 @@ int remap_io_mapping(struct vm_area_struct *vma,
 
 	GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
 
-	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
 	r.mm = vma->vm_mm;
 	r.pfn = pfn;
 	r.prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
@@ -140,7 +140,7 @@ int remap_io_sg(struct vm_area_struct *vma,
 	};
 	int err;
 
-	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
 	GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
 
 	while (offset >= r.sgt.max >> PAGE_SHIFT) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v1 11/11] mm/io-mapping: track_pfn() -> "pfnmap tracking"
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (9 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
@ 2025-04-25  8:17 ` David Hildenbrand
  2025-04-28 16:06   ` Lorenzo Stoakes
  2025-04-25  8:54 ` [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Ingo Molnar
  11 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  8:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
	David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

track_pfn() does not exist, let's simply refer to it as "pfnmap
tracking".

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/io-mapping.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/io-mapping.c b/mm/io-mapping.c
index 01b3627999304..7266441ad0834 100644
--- a/mm/io-mapping.c
+++ b/mm/io-mapping.c
@@ -21,7 +21,7 @@ int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma,
 	if (WARN_ON_ONCE((vma->vm_flags & expected_flags) != expected_flags))
 		return -EINVAL;
 
-	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
 	return remap_pfn_range_notrack(vma, addr, pfn, size,
 		__pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
 			 (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK)));
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
  2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
                   ` (10 preceding siblings ...)
  2025-04-25  8:17 ` [PATCH v1 11/11] mm/io-mapping: " David Hildenbrand
@ 2025-04-25  8:54 ` Ingo Molnar
  2025-04-25  9:27   ` David Hildenbrand
  11 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2025-04-25  8:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu


* David Hildenbrand <david@redhat.com> wrote:

> On top of mm-unstable.
> 
> VM_PAT annoyed me too much and wasted too much of my time, let's clean
> PAT handling up and remove VM_PAT.
> 
> This should sort out various issues with VM_PAT we discovered recently,
> and will hopefully make the whole code more stable and easier to maintain.
> 
> In essence: we stop letting PAT mode mess with VMAs and instead lift
> what to track/untrack to the MM core. We remember per VMA which pfn range
> we tracked in a new struct we attach to a VMA (we have space without
> exceeding 192 bytes), use a kref to share it among VMAs during
> split/mremap/fork, and automatically untrack once the kref drops to 0.

Yay!

The extra pointer in vm_area_struct is a small price to pay IMHO.

> This implies that we'll keep tracking a full pfn range even after partially
> unmapping it, until fully unmapping it; but as that case was mostly broken
> before, this at least makes it work in a way that is least intrusive to
> VMA handling.
> 
> Shrinking with mremap() used to work in a hacky way, now we'll similarly
> keep the original pfn range tacked even after this form of partial unmap.
> Does anybody care about that? Unlikely. If we run into issues, we could
> likely handled that (adjust the tracking) when our kref drops to 1 while
> freeing a VMA. But it adds more complexity, so avoid that for now.
> 
> Briefly tested
> 
> There will be some clash with [1], but nothing that cannot be sorted out
> easily by moving the functions added to kernel/fork.c to wherever the vma
> bits will live.
> 
> Briefly tested with some basic /dev/mem test I crafted. I want to convert
> them to selftests, but that might or might not require a bit of
> more work (e.g., /dev/mem accessibility).

So for the x86 bits, once it passes review by the fine MM folks:

  Acked-by: Ingo Molnar <mingo@kernel.org>

And I suppose this rewrite will be carried in -mm?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
  2025-04-25  8:54 ` [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Ingo Molnar
@ 2025-04-25  9:27   ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25  9:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

>> There will be some clash with [1], but nothing that cannot be sorted out
>> easily by moving the functions added to kernel/fork.c to wherever the vma
>> bits will live.
>>
>> Briefly tested with some basic /dev/mem test I crafted. I want to convert
>> them to selftests, but that might or might not require a bit of
>> more work (e.g., /dev/mem accessibility).
> 
> So for the x86 bits, once it passes review by the fine MM folks:
> 
>    Acked-by: Ingo Molnar <mingo@kernel.org>
> 

Thanks!

> And I suppose this rewrite will be carried in -mm?

Yes, will make conflicts with Lorenzo's work easier to resolve (in 
whatever order this ends up going in). I suspect there are not many PAT 
related things on the horizon.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-25  8:17 ` [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot() David Hildenbrand
@ 2025-04-25 19:31   ` Peter Xu
  2025-04-25 19:48     ` David Hildenbrand
  2025-04-25 19:56     ` David Hildenbrand
  0 siblings, 2 replies; 59+ messages in thread
From: Peter Xu @ 2025-04-25 19:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Fri, Apr 25, 2025 at 10:17:06AM +0200, David Hildenbrand wrote:
> ... by factoring it out from track_pfn_remap().
> 
> For PMDs/PUDs, actually check the full range, and trigger a fallback
> if we run into this "different memory types / cachemodes" scenario.

The current patch looks like to still pass PAGE_SIZE into the new helper at
all track_pfn_insert() call sites, so it seems this comment does not 100%
match with the code?  Or I may have misread somewhere.

Maybe it's still easier to keep the single-pfn lookup to never fail..  more
below.

> 
> Add some documentation.
> 
> Will checking each page result in undesired overhead? We'll have to
> learn. Not checking each page looks wrong, though. Maybe we could
> optimize the lookup internally.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/x86/mm/pat/memtype.c | 24 ++++++++----------------
>  include/linux/pgtable.h   | 28 ++++++++++++++++++++--------
>  mm/huge_memory.c          |  7 +++++--
>  mm/memory.c               |  4 ++--
>  4 files changed, 35 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index edec5859651d6..193e33251b18f 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1031,7 +1031,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  		    unsigned long pfn, unsigned long addr, unsigned long size)
>  {
>  	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> -	enum page_cache_mode pcm;
>  
>  	/* reserve the whole chunk starting from paddr */
>  	if (!vma || (addr == vma->vm_start
> @@ -1044,13 +1043,17 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  		return ret;
>  	}
>  
> +	return pfnmap_sanitize_pgprot(pfn, size, prot);
> +}
> +
> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size, pgprot_t *prot)
> +{
> +	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> +	enum page_cache_mode pcm;
> +
>  	if (!pat_enabled())
>  		return 0;
>  
> -	/*
> -	 * For anything smaller than the vma size we set prot based on the
> -	 * lookup.
> -	 */
>  	pcm = lookup_memtype(paddr);
>  
>  	/* Check memtype for the remaining pages */
> @@ -1065,17 +1068,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  	return 0;
>  }
>  
> -void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
> -{
> -	enum page_cache_mode pcm;
> -
> -	if (!pat_enabled())
> -		return;
> -
> -	pcm = lookup_memtype(pfn_t_to_phys(pfn));
> -	pgprot_set_cachemode(prot, pcm);
> -}
> -
>  /*
>   * untrack_pfn is called while unmapping a pfnmap for a region.
>   * untrack can be called for a specific region indicated by pfn and size or
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c921..91aadfe2515a5 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1500,13 +1500,10 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  	return 0;
>  }
>  
> -/*
> - * track_pfn_insert is called when a _new_ single pfn is established
> - * by vmf_insert_pfn().
> - */
> -static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> -				    pfn_t pfn)
> +static inline int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> +		pgprot_t *prot)
>  {
> +	return 0;
>  }
>  
>  /*
> @@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
>  extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  			   unsigned long pfn, unsigned long addr,
>  			   unsigned long size);
> -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> -			     pfn_t pfn);
> +
> +/**
> + * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range

Nit: s/sanitize/update|setup|.../?

But maybe you have good reason to use sanitize.  No strong opinions.

> + * @pfn: the start of the pfn range
> + * @size: the size of the pfn range
> + * @prot: the pgprot to sanitize
> + *
> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> + * cachemode.
> + *
> + * This function cannot fail for a single page, but can fail for multiple
> + * pages.
> + *
> + * Returns 0 on success and -EINVAL on error.
> + */
> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> +		pgprot_t *prot);
>  extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>  		struct vm_area_struct *src_vma, unsigned long *pfn);
>  extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fdcf0a6049b9f..b8ae5e1493315 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  			return VM_FAULT_OOM;
>  	}
>  
> -	track_pfn_insert(vma, &pgprot, pfn);
> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> +		return VM_FAULT_FALLBACK;

Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
trigger, though.

Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
a win already in all cases.

> +
>  	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>  	error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
>  			pgtable);
> @@ -1577,7 +1579,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	if (addr < vma->vm_start || addr >= vma->vm_end)
>  		return VM_FAULT_SIGBUS;
>  
> -	track_pfn_insert(vma, &pgprot, pfn);
> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> +		return VM_FAULT_FALLBACK;
>  
>  	ptl = pud_lock(vma->vm_mm, vmf->pud);
>  	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> diff --git a/mm/memory.c b/mm/memory.c
> index 424420349bd3c..c737a8625866a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2563,7 +2563,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
>  	if (!pfn_modify_allowed(pfn, pgprot))
>  		return VM_FAULT_SIGBUS;
>  
> -	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
> +	pfnmap_sanitize_pgprot(pfn, PAGE_SIZE, &pgprot);
>  
>  	return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
>  			false);
> @@ -2626,7 +2626,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
>  	if (addr < vma->vm_start || addr >= vma->vm_end)
>  		return VM_FAULT_SIGBUS;
>  
> -	track_pfn_insert(vma, &pgprot, pfn);
> +	pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
>  
>  	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
>  		return VM_FAULT_SIGBUS;
> -- 
> 2.49.0
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-25 19:31   ` Peter Xu
@ 2025-04-25 19:48     ` David Hildenbrand
  2025-04-25 23:59       ` Peter Xu
  2025-04-25 19:56     ` David Hildenbrand
  1 sibling, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25 19:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On 25.04.25 21:31, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:17:06AM +0200, David Hildenbrand wrote:
>> ... by factoring it out from track_pfn_remap().
>>
>> For PMDs/PUDs, actually check the full range, and trigger a fallback
>> if we run into this "different memory types / cachemodes" scenario.
> 
> The current patch looks like to still pass PAGE_SIZE into the new helper at
> all track_pfn_insert() call sites, so it seems this comment does not 100%
> match with the code?  Or I may have misread somewhere.

No, you're right, while reshuffling the patches I forgot to add the 
actual PMD/PUD size.

> 
> Maybe it's still easier to keep the single-pfn lookup to never fail..  more
> below.
> 

[...]

>>   /*
>> @@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
>>   extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>>   			   unsigned long pfn, unsigned long addr,
>>   			   unsigned long size);
>> -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
>> -			     pfn_t pfn);
>> +
>> +/**
>> + * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
> 
> Nit: s/sanitize/update|setup|.../?
> 
> But maybe you have good reason to use sanitize.  No strong opinions.

What it does on PAT (only implementation so far ...) is looking up the 
memory type to select the caching mode that can be use.

"sanitize" was IMHO a good fit, because we must make sure that we don't 
use the wrong caching mode.

update/setup/... don't make that quite clear. Any other suggestions?

> 
>> + * @pfn: the start of the pfn range
>> + * @size: the size of the pfn range
>> + * @prot: the pgprot to sanitize
>> + *
>> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
>> + * cachemode.
>> + *
>> + * This function cannot fail for a single page, but can fail for multiple
>> + * pages.
>> + *
>> + * Returns 0 on success and -EINVAL on error.
>> + */
>> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>> +		pgprot_t *prot);
>>   extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>>   		struct vm_area_struct *src_vma, unsigned long *pfn);
>>   extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fdcf0a6049b9f..b8ae5e1493315 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>   			return VM_FAULT_OOM;
>>   	}
>>   
>> -	track_pfn_insert(vma, &pgprot, pfn);
>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>> +		return VM_FAULT_FALLBACK;
> 
> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> trigger, though.
> 
> Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> a win already in all cases.

It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. 
That's certainly helpful for the single-page case.

Regarding never failing here: we should check the whole range. We have 
to make sure that none of the pages has a memory type / caching mode 
that is incompatible with what we setup.


Thanks a bunch for the review!
-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-25 19:31   ` Peter Xu
  2025-04-25 19:48     ` David Hildenbrand
@ 2025-04-25 19:56     ` David Hildenbrand
  1 sibling, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25 19:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

>>   
>> -	track_pfn_insert(vma, &pgprot, pfn);
>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>> +		return VM_FAULT_FALLBACK;
> 
> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> trigger, though.

Missed that comment. I can document that pgprot will only be touched if 
the function succeeds.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack()
  2025-04-25  8:17 ` [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack() David Hildenbrand
@ 2025-04-25 20:00   ` Peter Xu
  2025-04-25 20:14     ` David Hildenbrand
                       ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Peter Xu @ 2025-04-25 20:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Fri, Apr 25, 2025 at 10:17:08AM +0200, David Hildenbrand wrote:
> Let's use the new, cleaner interface.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memremap.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 2aebc1b192da9..c417c843e9b1f 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
>  	}
>  	mem_hotplug_done();
>  
> -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> +	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
>  	pgmap_array_delete(range);
>  }
>  
> @@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>  	if (nid < 0)
>  		nid = numa_mem_id();
>  
> -	error = track_pfn_remap(NULL, &params->pgprot, PHYS_PFN(range->start), 0,
> -			range_len(range));
> +	error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
> +			     &params->pgprot);
>  	if (error)
>  		goto err_pfn_remap;
>  
> @@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>  	if (!is_private)
>  		kasan_remove_zero_shadow(__va(range->start), range_len(range));
>  err_kasan:
> -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> +	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));

Not a huge deal, but maybe we could merge this and previous patch?  It
might be easier to reference the impl when reading the call site changes.

>  err_pfn_remap:
>  	pgmap_array_delete(range);
>  	return error;
> -- 
> 2.49.0
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack()
  2025-04-25 20:00   ` Peter Xu
@ 2025-04-25 20:14     ` David Hildenbrand
  2025-04-28 16:54     ` Lorenzo Stoakes
  2025-04-28 17:07     ` Lorenzo Stoakes
  2 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25 20:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On 25.04.25 22:00, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:17:08AM +0200, David Hildenbrand wrote:
>> Let's use the new, cleaner interface.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memremap.c | 8 ++++----
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 2aebc1b192da9..c417c843e9b1f 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
>>   	}
>>   	mem_hotplug_done();
>>   
>> -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
>> +	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
>>   	pgmap_array_delete(range);
>>   }
>>   
>> @@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>   	if (nid < 0)
>>   		nid = numa_mem_id();
>>   
>> -	error = track_pfn_remap(NULL, &params->pgprot, PHYS_PFN(range->start), 0,
>> -			range_len(range));
>> +	error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
>> +			     &params->pgprot);
>>   	if (error)
>>   		goto err_pfn_remap;
>>   
>> @@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>   	if (!is_private)
>>   		kasan_remove_zero_shadow(__va(range->start), range_len(range));
>>   err_kasan:
>> -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
>> +	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
> 
> Not a huge deal, but maybe we could merge this and previous patch?  It
> might be easier to reference the impl when reading the call site changes.

Yes, I can do that. The important part to me is to split #5 of, to keep 
that patch somewhat reasonable in size.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-25  8:17 ` [PATCH v1 05/11] mm: convert VM_PFNMAP tracking " David Hildenbrand
@ 2025-04-25 20:23   ` Peter Xu
  2025-04-25 20:36     ` David Hildenbrand
  2025-04-28 19:38   ` Lorenzo Stoakes
  2025-04-28 20:10   ` Lorenzo Stoakes
  2 siblings, 1 reply; 59+ messages in thread
From: Peter Xu @ 2025-04-25 20:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only sanitize the pgprot
> (partial VMA covered).
> 
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.
> 
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.
> 
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm_inline.h |  2 +
>  include/linux/mm_types.h  | 11 ++++++
>  kernel/fork.c             | 54 ++++++++++++++++++++++++--
>  mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>  mm/mremap.c               |  4 --
>  5 files changed, 128 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>  
>  #endif  /* CONFIG_ANON_VMA_NAME */
>  
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
>  static inline void init_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f91..91124761cfda8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -764,6 +764,14 @@ struct vma_numab_state {
>  	int prev_scan_seq;
>  };
>  
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> +	struct kref kref;
> +	unsigned long pfn;
> +	unsigned long size;
> +};
> +#endif
> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -877,6 +885,9 @@ struct vm_area_struct {
>  	struct anon_vma_name *anon_name;
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING
> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif

So this was originally the small concern (or is it small?) that this will
grow every vma on x86, am I right?

After all pfnmap vmas are the minority, I was thinking whether we could
work it out without extending vma struct.

I had a quick thought quite a while ago, but never tried out (it was almost
off-track since vfio switched away from remap_pfn_range..), which is to
have x86 maintain its own mapping of vma <-> pfn tracking using a global
stucture.  After all, the memtype code did it already with the
memtype_rbroot, so I was thinking if vma info can be memorized as well, so
as to get rid of get_pat_info() too.

Maybe it also needs the 2nd layer like what you did with the track ctx, but
the tree maintains the mapping instead of adding the ctx pointer into vma.

Maybe it could work with squashing the two layers (or say, extending
memtype rbtree), but maybe not..

It could make it slightly slower than vma->pfnmap_track_ctx ref when
looking up pfn when holding a vma ref, but I assume it's ok considering
that track/untrack should be slow path for pfnmaps, and pfnmaps shouldn't
be a huge lot.

I didn't think further, but if that'll work it'll definitely avoids the
additional fields on x86 vmas.  I'm curious whether you explored that
direction, or maybe it's a known decision that the 8 bytes isn't a concern.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-25 20:23   ` Peter Xu
@ 2025-04-25 20:36     ` David Hildenbrand
  2025-04-28 16:08       ` Peter Xu
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-25 20:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On 25.04.25 22:23, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
>> Let's use our new interface. In remap_pfn_range(), we'll now decide
>> whether we have to track (full VMA covered) or only sanitize the pgprot
>> (partial VMA covered).
>>
>> Remember what we have to untrack by linking it from the VMA. When
>> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
>> to anon VMA names, and use a kref to share the tracking.
>>
>> Once the last VMA un-refs our tracking data, we'll do the untracking,
>> which simplifies things a lot and should sort our various issues we saw
>> recently, for example, when partially unmapping/zapping a tracked VMA.
>>
>> This change implies that we'll keep tracking the original PFN range even
>> after splitting + partially unmapping it: not too bad, because it was
>> not working reliably before. The only thing that kind-of worked before
>> was shrinking such a mapping using mremap(): we managed to adjust the
>> reservation in a hacky way, now we won't adjust the reservation but
>> leave it around until all involved VMAs are gone.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/mm_inline.h |  2 +
>>   include/linux/mm_types.h  | 11 ++++++
>>   kernel/fork.c             | 54 ++++++++++++++++++++++++--
>>   mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>>   mm/mremap.c               |  4 --
>>   5 files changed, 128 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index f9157a0c42a5c..89b518ff097e6 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>   
>>   #endif  /* CONFIG_ANON_VMA_NAME */
>>   
>> +void pfnmap_track_ctx_release(struct kref *ref);
>> +
>>   static inline void init_tlb_flush_pending(struct mm_struct *mm)
>>   {
>>   	atomic_set(&mm->tlb_flush_pending, 0);
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 56d07edd01f91..91124761cfda8 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -764,6 +764,14 @@ struct vma_numab_state {
>>   	int prev_scan_seq;
>>   };
>>   
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +struct pfnmap_track_ctx {
>> +	struct kref kref;
>> +	unsigned long pfn;
>> +	unsigned long size;
>> +};
>> +#endif
>> +
>>   /*
>>    * This struct describes a virtual memory area. There is one of these
>>    * per VM-area/task. A VM area is any part of the process virtual memory
>> @@ -877,6 +885,9 @@ struct vm_area_struct {
>>   	struct anon_vma_name *anon_name;
>>   #endif
>>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
>> +#endif
> 
> So this was originally the small concern (or is it small?) that this will
> grow every vma on x86, am I right?

Yeah, and last time I looked into this, it would have grown it such that it would
require a bigger slab. Right now:

Before this change:

struct vm_area_struct {
	union {
		struct {
			long unsigned int vm_start;      /*     0     8 */
			long unsigned int vm_end;        /*     8     8 */
		};                                       /*     0    16 */
		freeptr_t          vm_freeptr;           /*     0     8 */
	};                                               /*     0    16 */
	struct mm_struct *         vm_mm;                /*    16     8 */
	pgprot_t                   vm_page_prot;         /*    24     8 */
	union {
		const vm_flags_t   vm_flags;             /*    32     8 */
		vm_flags_t         __vm_flags;           /*    32     8 */
	};                                               /*    32     8 */
	unsigned int               vm_lock_seq;          /*    40     4 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           anon_vma_chain;       /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct anon_vma *          anon_vma;             /*    64     8 */
	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
	long unsigned int          vm_pgoff;             /*    80     8 */
	struct file *              vm_file;              /*    88     8 */
	void *                     vm_private_data;      /*    96     8 */
	atomic_long_t              swap_readahead_info;  /*   104     8 */
	struct mempolicy *         vm_policy;            /*   112     8 */
	struct vma_numab_state *   numab_state;          /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	refcount_t                 vm_refcnt __attribute__((__aligned__(64))); /*   128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct {
		struct rb_node     rb __attribute__((__aligned__(8))); /*   136    24 */
		long unsigned int  rb_subtree_last;      /*   160     8 */
	} __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8)));        /*   136    32 */
	struct anon_vma_name *     anon_name;            /*   168     8 */
	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */

	/* size: 192, cachelines: 3, members: 18 */
	/* sum members: 168, holes: 2, sum holes: 8 */
	/* padding: 16 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
} __attribute__((__aligned__(64)));

After this change:

struct vm_area_struct {
	union {
		struct {
			long unsigned int vm_start;      /*     0     8 */
			long unsigned int vm_end;        /*     8     8 */
		};                                       /*     0    16 */
		freeptr_t          vm_freeptr;           /*     0     8 */
	};                                               /*     0    16 */
	struct mm_struct *         vm_mm;                /*    16     8 */
	pgprot_t                   vm_page_prot;         /*    24     8 */
	union {
		const vm_flags_t   vm_flags;             /*    32     8 */
		vm_flags_t         __vm_flags;           /*    32     8 */
	};                                               /*    32     8 */
	unsigned int               vm_lock_seq;          /*    40     4 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           anon_vma_chain;       /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct anon_vma *          anon_vma;             /*    64     8 */
	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
	long unsigned int          vm_pgoff;             /*    80     8 */
	struct file *              vm_file;              /*    88     8 */
	void *                     vm_private_data;      /*    96     8 */
	atomic_long_t              swap_readahead_info;  /*   104     8 */
	struct mempolicy *         vm_policy;            /*   112     8 */
	struct vma_numab_state *   numab_state;          /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	refcount_t                 vm_refcnt __attribute__((__aligned__(64))); /*   128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct {
		struct rb_node     rb __attribute__((__aligned__(8))); /*   136    24 */
		long unsigned int  rb_subtree_last;      /*   160     8 */
	} __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8)));        /*   136    32 */
	struct anon_vma_name *     anon_name;            /*   168     8 */
	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
	struct pfnmap_track_ctx *  pfnmap_track_ctx;     /*   176     8 */

	/* size: 192, cachelines: 3, members: 19 */
	/* sum members: 176, holes: 2, sum holes: 8 */
	/* padding: 8 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
} __attribute__((__aligned__(64)));

Observe that we allocate 192 bytes with or without pfnmap_track_ctx. (IIRC,
slab sizes are ... 128, 192, 256, 512, ...)

> 
> After all pfnmap vmas are the minority, I was thinking whether we could
> work it out without extending vma struct.

Heh, similar to userfaultfd on most systems, or ones with a mempolicy, or
anon vma names, ... :)

But yeah, pfnmap is certainly a minority as well.

> 
> I had a quick thought quite a while ago, but never tried out (it was almost
> off-track since vfio switched away from remap_pfn_range..), which is to
> have x86 maintain its own mapping of vma <-> pfn tracking using a global
> stucture.  After all, the memtype code did it already with the
> memtype_rbroot, so I was thinking if vma info can be memorized as well, so
> as to get rid of get_pat_info() too.
> 
> Maybe it also needs the 2nd layer like what you did with the track ctx, but
> the tree maintains the mapping instead of adding the ctx pointer into vma.
> 
> Maybe it could work with squashing the two layers (or say, extending
> memtype rbtree), but maybe not..
> 
> It could make it slightly slower than vma->pfnmap_track_ctx ref when
> looking up pfn when holding a vma ref, but I assume it's ok considering
> that track/untrack should be slow path for pfnmaps, and pfnmaps shouldn't
> be a huge lot.
> 
> I didn't think further, but if that'll work it'll definitely avoids the
> additional fields on x86 vmas.  I'm curious whether you explored that
> direction, or maybe it's a known decision that the 8 bytes isn't a concern.

When discussing this approach with Lorenzo, I raised that we could simply
map using an xarray the VMA to that structure.

But then, if we're not effectively allocating any more space, then probably
not worth adding more complexity right now.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-25 19:48     ` David Hildenbrand
@ 2025-04-25 23:59       ` Peter Xu
  2025-04-28 14:58         ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Xu @ 2025-04-25 23:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Fri, Apr 25, 2025 at 09:48:50PM +0200, David Hildenbrand wrote:
> On 25.04.25 21:31, Peter Xu wrote:
> > On Fri, Apr 25, 2025 at 10:17:06AM +0200, David Hildenbrand wrote:
> > > ... by factoring it out from track_pfn_remap().
> > > 
> > > For PMDs/PUDs, actually check the full range, and trigger a fallback
> > > if we run into this "different memory types / cachemodes" scenario.
> > 
> > The current patch looks like to still pass PAGE_SIZE into the new helper at
> > all track_pfn_insert() call sites, so it seems this comment does not 100%
> > match with the code?  Or I may have misread somewhere.
> 
> No, you're right, while reshuffling the patches I forgot to add the actual
> PMD/PUD size.
> 
> > 
> > Maybe it's still easier to keep the single-pfn lookup to never fail..  more
> > below.
> > 
> 
> [...]
> 
> > >   /*
> > > @@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> > >   extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> > >   			   unsigned long pfn, unsigned long addr,
> > >   			   unsigned long size);
> > > -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> > > -			     pfn_t pfn);
> > > +
> > > +/**
> > > + * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
> > 
> > Nit: s/sanitize/update|setup|.../?
> > 
> > But maybe you have good reason to use sanitize.  No strong opinions.
> 
> What it does on PAT (only implementation so far ...) is looking up the
> memory type to select the caching mode that can be use.
> 
> "sanitize" was IMHO a good fit, because we must make sure that we don't use
> the wrong caching mode.
> 
> update/setup/... don't make that quite clear. Any other suggestions?

I'm very poor on naming.. :( So far anything seems slightly better than
sanitize to me, as the word "sanitize" is actually also used in memtype.c
for other purpose.. see sanitize_phys().

> 
> > 
> > > + * @pfn: the start of the pfn range
> > > + * @size: the size of the pfn range
> > > + * @prot: the pgprot to sanitize
> > > + *
> > > + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> > > + * cachemode.
> > > + *
> > > + * This function cannot fail for a single page, but can fail for multiple
> > > + * pages.
> > > + *
> > > + * Returns 0 on success and -EINVAL on error.
> > > + */
> > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > +		pgprot_t *prot);
> > >   extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > >   		struct vm_area_struct *src_vma, unsigned long *pfn);
> > >   extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> > >   			return VM_FAULT_OOM;
> > >   	}
> > > -	track_pfn_insert(vma, &pgprot, pfn);
> > > +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> > > +		return VM_FAULT_FALLBACK;
> > 
> > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> > trigger, though.
> > 
> > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> > replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> > a win already in all cases.
> 
> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
> certainly helpful for the single-page case.
> 
> Regarding never failing here: we should check the whole range. We have to
> make sure that none of the pages has a memory type / caching mode that is
> incompatible with what we setup.

Would it happen in real world?

IIUC per-vma registration needs to happen first, which checks for memtype
conflicts in the first place, or reserve_pfn_range() could already have
failed.

Here it's the fault path looking up the memtype, so I would expect it is
guaranteed all pfns under the same vma is following the verified (and same)
memtype?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-25 23:59       ` Peter Xu
@ 2025-04-28 14:58         ` David Hildenbrand
  2025-04-28 16:21           ` Peter Xu
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 14:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato


>> What it does on PAT (only implementation so far ...) is looking up the
>> memory type to select the caching mode that can be use.
>>
>> "sanitize" was IMHO a good fit, because we must make sure that we don't use
>> the wrong caching mode.
>>
>> update/setup/... don't make that quite clear. Any other suggestions?
> 
> I'm very poor on naming.. :( So far anything seems slightly better than
> sanitize to me, as the word "sanitize" is actually also used in memtype.c
> for other purpose.. see sanitize_phys().

Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, 
in the other functions it's an address.

Likely we should just call it pfnmap_X_cachemode()/

Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.

pfnmap_setup_cachemode() ? Hm.

> 
>>
>>>
>>>> + * @pfn: the start of the pfn range
>>>> + * @size: the size of the pfn range
>>>> + * @prot: the pgprot to sanitize
>>>> + *
>>>> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
>>>> + * cachemode.
>>>> + *
>>>> + * This function cannot fail for a single page, but can fail for multiple
>>>> + * pages.
>>>> + *
>>>> + * Returns 0 on success and -EINVAL on error.
>>>> + */
>>>> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>>>> +		pgprot_t *prot);
>>>>    extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>>>>    		struct vm_area_struct *src_vma, unsigned long *pfn);
>>>>    extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index fdcf0a6049b9f..b8ae5e1493315 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>>>    			return VM_FAULT_OOM;
>>>>    	}
>>>> -	track_pfn_insert(vma, &pgprot, pfn);
>>>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>>>> +		return VM_FAULT_FALLBACK;
>>>
>>> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
>>> trigger, though.
>>>
>>> Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
>>> replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
>>> a win already in all cases.
>>
>> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
>> certainly helpful for the single-page case.
>>
>> Regarding never failing here: we should check the whole range. We have to
>> make sure that none of the pages has a memory type / caching mode that is
>> incompatible with what we setup.
> 
> Would it happen in real world?
 > > IIUC per-vma registration needs to happen first, which checks for 
memtype
> conflicts in the first place, or reserve_pfn_range() could already have
> failed.
 > > Here it's the fault path looking up the memtype, so I would expect 
it is
> guaranteed all pfns under the same vma is following the verified (and same)
> memtype?

The whole point of track_pfn_insert() is that it is used when we *don't* 
use reserve_pfn_range()->track_pfn_remap(), no?

track_pfn_remap() would check the whole range that gets mapped, so 
track_pfn_insert() user must similarly check the whole range that gets 
mapped.

Note that even track_pfn_insert() is already pretty clear on the 
intended usage: "called when a _new_ single pfn is established"

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 11/11] mm/io-mapping: track_pfn() -> "pfnmap tracking"
  2025-04-25  8:17 ` [PATCH v1 11/11] mm/io-mapping: " David Hildenbrand
@ 2025-04-28 16:06   ` Lorenzo Stoakes
  2025-04-28 16:14     ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 16:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Fri, Apr 25, 2025 at 10:17:15AM +0200, David Hildenbrand wrote:
> track_pfn() does not exist, let's simply refer to it as "pfnmap
> tracking".
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/io-mapping.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/io-mapping.c b/mm/io-mapping.c
> index 01b3627999304..7266441ad0834 100644
> --- a/mm/io-mapping.c
> +++ b/mm/io-mapping.c
> @@ -21,7 +21,7 @@ int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma,
>  	if (WARN_ON_ONCE((vma->vm_flags & expected_flags) != expected_flags))
>  		return -EINVAL;
>
> -	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
> +	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
>  	return remap_pfn_range_notrack(vma, addr, pfn, size,
>  		__pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
>  			 (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK)));
> --
> 2.49.0
>

However this doesn't apply after commit b8d8f1830bab ("mm/io-mapping:
precompute remap protection flags for clarity"), so will need a rebase :)
seems this was cleaned up to separate the __pgprot() bit from the
remap_pfn_range_notrack().

Note of course this commit hash is from mm-new so quite changeable... :)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-25 20:36     ` David Hildenbrand
@ 2025-04-28 16:08       ` Peter Xu
  2025-04-28 16:16         ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Xu @ 2025-04-28 16:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Fri, Apr 25, 2025 at 10:36:55PM +0200, David Hildenbrand wrote:
> On 25.04.25 22:23, Peter Xu wrote:
> > On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> > > Let's use our new interface. In remap_pfn_range(), we'll now decide
> > > whether we have to track (full VMA covered) or only sanitize the pgprot
> > > (partial VMA covered).
> > > 
> > > Remember what we have to untrack by linking it from the VMA. When
> > > duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> > > to anon VMA names, and use a kref to share the tracking.
> > > 
> > > Once the last VMA un-refs our tracking data, we'll do the untracking,
> > > which simplifies things a lot and should sort our various issues we saw
> > > recently, for example, when partially unmapping/zapping a tracked VMA.
> > > 
> > > This change implies that we'll keep tracking the original PFN range even
> > > after splitting + partially unmapping it: not too bad, because it was
> > > not working reliably before. The only thing that kind-of worked before
> > > was shrinking such a mapping using mremap(): we managed to adjust the
> > > reservation in a hacky way, now we won't adjust the reservation but
> > > leave it around until all involved VMAs are gone.
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   include/linux/mm_inline.h |  2 +
> > >   include/linux/mm_types.h  | 11 ++++++
> > >   kernel/fork.c             | 54 ++++++++++++++++++++++++--
> > >   mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
> > >   mm/mremap.c               |  4 --
> > >   5 files changed, 128 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > index f9157a0c42a5c..89b518ff097e6 100644
> > > --- a/include/linux/mm_inline.h
> > > +++ b/include/linux/mm_inline.h
> > > @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
> > >   #endif  /* CONFIG_ANON_VMA_NAME */
> > > +void pfnmap_track_ctx_release(struct kref *ref);
> > > +
> > >   static inline void init_tlb_flush_pending(struct mm_struct *mm)
> > >   {
> > >   	atomic_set(&mm->tlb_flush_pending, 0);
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 56d07edd01f91..91124761cfda8 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -764,6 +764,14 @@ struct vma_numab_state {
> > >   	int prev_scan_seq;
> > >   };
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > +struct pfnmap_track_ctx {
> > > +	struct kref kref;
> > > +	unsigned long pfn;
> > > +	unsigned long size;
> > > +};
> > > +#endif
> > > +
> > >   /*
> > >    * This struct describes a virtual memory area. There is one of these
> > >    * per VM-area/task. A VM area is any part of the process virtual memory
> > > @@ -877,6 +885,9 @@ struct vm_area_struct {
> > >   	struct anon_vma_name *anon_name;
> > >   #endif
> > >   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > +	struct pfnmap_track_ctx *pfnmap_track_ctx;
> > > +#endif
> > 
> > So this was originally the small concern (or is it small?) that this will
> > grow every vma on x86, am I right?
> 
> Yeah, and last time I looked into this, it would have grown it such that it would
> require a bigger slab. Right now:

Probably due to what config you have.  E.g., when I'm looking mine it's
much bigger and already consuming 256B, but it's because I enabled more
things (userfaultfd, lockdep, etc.).

> 
> Before this change:
> 
> struct vm_area_struct {
> 	union {
> 		struct {
> 			long unsigned int vm_start;      /*     0     8 */
> 			long unsigned int vm_end;        /*     8     8 */
> 		};                                       /*     0    16 */
> 		freeptr_t          vm_freeptr;           /*     0     8 */
> 	};                                               /*     0    16 */
> 	struct mm_struct *         vm_mm;                /*    16     8 */
> 	pgprot_t                   vm_page_prot;         /*    24     8 */
> 	union {
> 		const vm_flags_t   vm_flags;             /*    32     8 */
> 		vm_flags_t         __vm_flags;           /*    32     8 */
> 	};                                               /*    32     8 */
> 	unsigned int               vm_lock_seq;          /*    40     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct list_head           anon_vma_chain;       /*    48    16 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct anon_vma *          anon_vma;             /*    64     8 */
> 	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> 	long unsigned int          vm_pgoff;             /*    80     8 */
> 	struct file *              vm_file;              /*    88     8 */
> 	void *                     vm_private_data;      /*    96     8 */
> 	atomic_long_t              swap_readahead_info;  /*   104     8 */
> 	struct mempolicy *         vm_policy;            /*   112     8 */
> 	struct vma_numab_state *   numab_state;          /*   120     8 */
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	refcount_t                 vm_refcnt __attribute__((__aligned__(64))); /*   128     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct {
> 		struct rb_node     rb __attribute__((__aligned__(8))); /*   136    24 */
> 		long unsigned int  rb_subtree_last;      /*   160     8 */
> 	} __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8)));        /*   136    32 */
> 	struct anon_vma_name *     anon_name;            /*   168     8 */
> 	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> 
> 	/* size: 192, cachelines: 3, members: 18 */
> 	/* sum members: 168, holes: 2, sum holes: 8 */
> 	/* padding: 16 */
> 	/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
> } __attribute__((__aligned__(64)));
> 
> After this change:
> 
> struct vm_area_struct {
> 	union {
> 		struct {
> 			long unsigned int vm_start;      /*     0     8 */
> 			long unsigned int vm_end;        /*     8     8 */
> 		};                                       /*     0    16 */
> 		freeptr_t          vm_freeptr;           /*     0     8 */
> 	};                                               /*     0    16 */
> 	struct mm_struct *         vm_mm;                /*    16     8 */
> 	pgprot_t                   vm_page_prot;         /*    24     8 */
> 	union {
> 		const vm_flags_t   vm_flags;             /*    32     8 */
> 		vm_flags_t         __vm_flags;           /*    32     8 */
> 	};                                               /*    32     8 */
> 	unsigned int               vm_lock_seq;          /*    40     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct list_head           anon_vma_chain;       /*    48    16 */
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct anon_vma *          anon_vma;             /*    64     8 */
> 	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> 	long unsigned int          vm_pgoff;             /*    80     8 */
> 	struct file *              vm_file;              /*    88     8 */
> 	void *                     vm_private_data;      /*    96     8 */
> 	atomic_long_t              swap_readahead_info;  /*   104     8 */
> 	struct mempolicy *         vm_policy;            /*   112     8 */
> 	struct vma_numab_state *   numab_state;          /*   120     8 */
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	refcount_t                 vm_refcnt __attribute__((__aligned__(64))); /*   128     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	struct {
> 		struct rb_node     rb __attribute__((__aligned__(8))); /*   136    24 */
> 		long unsigned int  rb_subtree_last;      /*   160     8 */
> 	} __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8)));        /*   136    32 */
> 	struct anon_vma_name *     anon_name;            /*   168     8 */
> 	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> 	struct pfnmap_track_ctx *  pfnmap_track_ctx;     /*   176     8 */
> 
> 	/* size: 192, cachelines: 3, members: 19 */
> 	/* sum members: 176, holes: 2, sum holes: 8 */
> 	/* padding: 8 */
> 	/* forced alignments: 2, forced holes: 1, sum forced holes: 4 */
> } __attribute__((__aligned__(64)));
> 
> Observe that we allocate 192 bytes with or without pfnmap_track_ctx. (IIRC,
> slab sizes are ... 128, 192, 256, 512, ...)

True. I just double checked, vm_area_cachep has SLAB_HWCACHE_ALIGN set, I
think it means it's working like that on x86_64 at least indeed.  So looks
like the new field at least isn't an immediate concern.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 11/11] mm/io-mapping: track_pfn() -> "pfnmap tracking"
  2025-04-28 16:06   ` Lorenzo Stoakes
@ 2025-04-28 16:14     ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 16:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On 28.04.25 18:06, Lorenzo Stoakes wrote:
> On Fri, Apr 25, 2025 at 10:17:15AM +0200, David Hildenbrand wrote:
>> track_pfn() does not exist, let's simply refer to it as "pfnmap
>> tracking".
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> LGTM, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
>> ---
>>   mm/io-mapping.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/io-mapping.c b/mm/io-mapping.c
>> index 01b3627999304..7266441ad0834 100644
>> --- a/mm/io-mapping.c
>> +++ b/mm/io-mapping.c
>> @@ -21,7 +21,7 @@ int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma,
>>   	if (WARN_ON_ONCE((vma->vm_flags & expected_flags) != expected_flags))
>>   		return -EINVAL;
>>
>> -	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
>> +	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
>>   	return remap_pfn_range_notrack(vma, addr, pfn, size,
>>   		__pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
>>   			 (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK)));
>> --
>> 2.49.0
>>
> 
> However this doesn't apply after commit b8d8f1830bab ("mm/io-mapping:
> precompute remap protection flags for clarity"), so will need a rebase :)
> seems this was cleaned up to separate the __pgprot() bit from the
> remap_pfn_range_notrack().

Yeah, I reviewed that just today. Trivial conflict :)

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 16:08       ` Peter Xu
@ 2025-04-28 16:16         ` David Hildenbrand
  2025-04-28 16:24           ` Peter Xu
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 16:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On 28.04.25 18:08, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:36:55PM +0200, David Hildenbrand wrote:
>> On 25.04.25 22:23, Peter Xu wrote:
>>> On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
>>>> Let's use our new interface. In remap_pfn_range(), we'll now decide
>>>> whether we have to track (full VMA covered) or only sanitize the pgprot
>>>> (partial VMA covered).
>>>>
>>>> Remember what we have to untrack by linking it from the VMA. When
>>>> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
>>>> to anon VMA names, and use a kref to share the tracking.
>>>>
>>>> Once the last VMA un-refs our tracking data, we'll do the untracking,
>>>> which simplifies things a lot and should sort our various issues we saw
>>>> recently, for example, when partially unmapping/zapping a tracked VMA.
>>>>
>>>> This change implies that we'll keep tracking the original PFN range even
>>>> after splitting + partially unmapping it: not too bad, because it was
>>>> not working reliably before. The only thing that kind-of worked before
>>>> was shrinking such a mapping using mremap(): we managed to adjust the
>>>> reservation in a hacky way, now we won't adjust the reservation but
>>>> leave it around until all involved VMAs are gone.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    include/linux/mm_inline.h |  2 +
>>>>    include/linux/mm_types.h  | 11 ++++++
>>>>    kernel/fork.c             | 54 ++++++++++++++++++++++++--
>>>>    mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>>>>    mm/mremap.c               |  4 --
>>>>    5 files changed, 128 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>>>> index f9157a0c42a5c..89b518ff097e6 100644
>>>> --- a/include/linux/mm_inline.h
>>>> +++ b/include/linux/mm_inline.h
>>>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>>>    #endif  /* CONFIG_ANON_VMA_NAME */
>>>> +void pfnmap_track_ctx_release(struct kref *ref);
>>>> +
>>>>    static inline void init_tlb_flush_pending(struct mm_struct *mm)
>>>>    {
>>>>    	atomic_set(&mm->tlb_flush_pending, 0);
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 56d07edd01f91..91124761cfda8 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -764,6 +764,14 @@ struct vma_numab_state {
>>>>    	int prev_scan_seq;
>>>>    };
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +struct pfnmap_track_ctx {
>>>> +	struct kref kref;
>>>> +	unsigned long pfn;
>>>> +	unsigned long size;
>>>> +};
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * This struct describes a virtual memory area. There is one of these
>>>>     * per VM-area/task. A VM area is any part of the process virtual memory
>>>> @@ -877,6 +885,9 @@ struct vm_area_struct {
>>>>    	struct anon_vma_name *anon_name;
>>>>    #endif
>>>>    	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
>>>> +#endif
>>>
>>> So this was originally the small concern (or is it small?) that this will
>>> grow every vma on x86, am I right?
>>
>> Yeah, and last time I looked into this, it would have grown it such that it would
>> require a bigger slab. Right now:
> 
> Probably due to what config you have.  E.g., when I'm looking mine it's
> much bigger and already consuming 256B, but it's because I enabled more
> things (userfaultfd, lockdep, etc.).

Note that I enabled everything that you would expect on a production 
system (incld. userfaultfd, mempolicy, per-vma locks), so I didn't 
enable lockep.

Thanks for verifying!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
  2025-04-25  8:17 ` [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
@ 2025-04-28 16:16   ` Lorenzo Stoakes
  2025-04-28 16:19     ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 16:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Fri, Apr 25, 2025 at 10:17:05AM +0200, David Hildenbrand wrote:
> Let's factor it out to make the code easier to grasp.
>
> Use it also in pgprot_writecombine()/pgprot_writethrough() where
> clearing the old cachemode might not be required, but given that we are
> already doing a function call, no need to care about this
> micro-optimization.

Ah my kind of patch :)

>
> Signed-off-by: David Hildenbrand <david@redhat.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  arch/x86/mm/pat/memtype.c | 33 +++++++++++++++------------------
>  1 file changed, 15 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 72d8cbc611583..edec5859651d6 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -800,6 +800,12 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size)
>  }
>  #endif /* CONFIG_STRICT_DEVMEM */
>
> +static inline void pgprot_set_cachemode(pgprot_t *prot, enum page_cache_mode pcm)
> +{
> +	*prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) |
> +			 cachemode2protval(pcm));
> +}
> +
>  int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
>  				unsigned long size, pgprot_t *vma_prot)
>  {
> @@ -811,8 +817,7 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
>  	if (file->f_flags & O_DSYNC)
>  		pcm = _PAGE_CACHE_MODE_UC_MINUS;
>
> -	*vma_prot = __pgprot((pgprot_val(*vma_prot) & ~_PAGE_CACHE_MASK) |
> -			     cachemode2protval(pcm));
> +	pgprot_set_cachemode(vma_prot, pcm);
>  	return 1;
>  }
>
> @@ -880,9 +885,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
>  				(unsigned long long)paddr,
>  				(unsigned long long)(paddr + size - 1),
>  				cattr_name(pcm));
> -			*vma_prot = __pgprot((pgprot_val(*vma_prot) &
> -					     (~_PAGE_CACHE_MASK)) |
> -					     cachemode2protval(pcm));
> +			pgprot_set_cachemode(vma_prot, pcm);
>  		}
>  		return 0;
>  	}
> @@ -907,9 +910,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
>  		 * We allow returning different type than the one requested in
>  		 * non strict case.
>  		 */
> -		*vma_prot = __pgprot((pgprot_val(*vma_prot) &
> -				      (~_PAGE_CACHE_MASK)) |
> -				     cachemode2protval(pcm));
> +		pgprot_set_cachemode(vma_prot, pcm);
>  	}
>
>  	if (memtype_kernel_map_sync(paddr, size, pcm) < 0) {
> @@ -1060,9 +1061,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  			return -EINVAL;
>  	}
>
> -	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
> -			 cachemode2protval(pcm));
> -
> +	pgprot_set_cachemode(prot, pcm);
>  	return 0;
>  }
>
> @@ -1073,10 +1072,8 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
>  	if (!pat_enabled())
>  		return;
>
> -	/* Set prot based on lookup */

We're losing a comment here but who cares, it's obvious what's happening.

>  	pcm = lookup_memtype(pfn_t_to_phys(pfn));
> -	*prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
> -			 cachemode2protval(pcm));
> +	pgprot_set_cachemode(prot, pcm);
>  }
>
>  /*
> @@ -1115,15 +1112,15 @@ void untrack_pfn_clear(struct vm_area_struct *vma)
>
>  pgprot_t pgprot_writecombine(pgprot_t prot)
>  {
> -	return __pgprot(pgprot_val(prot) |
> -				cachemode2protval(_PAGE_CACHE_MODE_WC));
> +	pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC);
> +	return prot;
>  }
>  EXPORT_SYMBOL_GPL(pgprot_writecombine);
>
>  pgprot_t pgprot_writethrough(pgprot_t prot)
>  {
> -	return __pgprot(pgprot_val(prot) |
> -				cachemode2protval(_PAGE_CACHE_MODE_WT));
> +	pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WT);
> +	return prot;
>  }
>  EXPORT_SYMBOL_GPL(pgprot_writethrough);
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
  2025-04-28 16:16   ` Lorenzo Stoakes
@ 2025-04-28 16:19     ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 16:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

n 0;
>>   }
>>
>> @@ -1073,10 +1072,8 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
>>   	if (!pat_enabled())
>>   		return;
>>
>> -	/* Set prot based on lookup */
> 
> We're losing a comment here but who cares, it's obvious what's happening.
> 

Yeah, it's now self-documented :)

lookup ... set cachemode


Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-28 14:58         ` David Hildenbrand
@ 2025-04-28 16:21           ` Peter Xu
  2025-04-28 20:37             ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Xu @ 2025-04-28 16:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
> 
> > > What it does on PAT (only implementation so far ...) is looking up the
> > > memory type to select the caching mode that can be use.
> > > 
> > > "sanitize" was IMHO a good fit, because we must make sure that we don't use
> > > the wrong caching mode.
> > > 
> > > update/setup/... don't make that quite clear. Any other suggestions?
> > 
> > I'm very poor on naming.. :( So far anything seems slightly better than
> > sanitize to me, as the word "sanitize" is actually also used in memtype.c
> > for other purpose.. see sanitize_phys().
> 
> Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
> the other functions it's an address.
> 
> Likely we should just call it pfnmap_X_cachemode()/
> 
> Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
> 
> pfnmap_setup_cachemode() ? Hm.

Sounds good here.

> 
> > 
> > > 
> > > > 
> > > > > + * @pfn: the start of the pfn range
> > > > > + * @size: the size of the pfn range
> > > > > + * @prot: the pgprot to sanitize
> > > > > + *
> > > > > + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> > > > > + * cachemode.
> > > > > + *
> > > > > + * This function cannot fail for a single page, but can fail for multiple
> > > > > + * pages.
> > > > > + *
> > > > > + * Returns 0 on success and -EINVAL on error.
> > > > > + */
> > > > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > > > +		pgprot_t *prot);
> > > > >    extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > > >    		struct vm_area_struct *src_vma, unsigned long *pfn);
> > > > >    extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> > > > >    			return VM_FAULT_OOM;
> > > > >    	}
> > > > > -	track_pfn_insert(vma, &pgprot, pfn);
> > > > > +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> > > > > +		return VM_FAULT_FALLBACK;
> > > > 
> > > > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> > > > trigger, though.
> > > > 
> > > > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> > > > replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> > > > a win already in all cases.
> > > 
> > > It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
> > > certainly helpful for the single-page case.
> > > 
> > > Regarding never failing here: we should check the whole range. We have to
> > > make sure that none of the pages has a memory type / caching mode that is
> > > incompatible with what we setup.
> > 
> > Would it happen in real world?
> > > IIUC per-vma registration needs to happen first, which checks for
> memtype
> > conflicts in the first place, or reserve_pfn_range() could already have
> > failed.
> > > Here it's the fault path looking up the memtype, so I would expect it is
> > guaranteed all pfns under the same vma is following the verified (and same)
> > memtype?
> 
> The whole point of track_pfn_insert() is that it is used when we *don't* use
> reserve_pfn_range()->track_pfn_remap(), no?
> 
> track_pfn_remap() would check the whole range that gets mapped, so
> track_pfn_insert() user must similarly check the whole range that gets
> mapped.
> 
> Note that even track_pfn_insert() is already pretty clear on the intended
> usage: "called when a _new_ single pfn is established"

We need to define "new" then..  But I agree it's not crystal clear at
least.  I think I just wasn't the first to assume it was reserved, see this
(especially, the "Expectation" part..):

commit 5180da410db6369d1f95c9014da1c9bc33fb043e
Author: Suresh Siddha <suresh.b.siddha@intel.com>
Date:   Mon Oct 8 16:28:29 2012 -0700

    x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn
    
    With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
    attribute and uses it.  Expectation is that the driver reserves the
    memory attributes for the pfn before calling vm_insert_pfn().

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 16:16         ` David Hildenbrand
@ 2025-04-28 16:24           ` Peter Xu
  2025-04-28 17:23             ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Xu @ 2025-04-28 16:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
> > Probably due to what config you have.  E.g., when I'm looking mine it's
> > much bigger and already consuming 256B, but it's because I enabled more
> > things (userfaultfd, lockdep, etc.).
> 
> Note that I enabled everything that you would expect on a production system
> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.

I still doubt whether you at least enabled userfaultfd, e.g., your previous
paste has:

  struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */

Not something that matters.. but just in case you didn't use the expected
config file you wanted to use..

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack()
  2025-04-25  8:17 ` [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack() David Hildenbrand
@ 2025-04-28 16:53   ` Lorenzo Stoakes
  2025-04-28 17:12     ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 16:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Fri, Apr 25, 2025 at 10:17:07AM +0200, David Hildenbrand wrote:
> Let's provide variants of track_pfn_remap() and untrack_pfn() that won't
> mess with VMAs, to replace the existing interface step-by-step.
>
> Add some documentation.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

There's some pedantry below, but this looks fine generally, so
notwithstanding that,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  arch/x86/mm/pat/memtype.c | 14 ++++++++++++++
>  include/linux/pgtable.h   | 33 +++++++++++++++++++++++++++++++++
>  2 files changed, 47 insertions(+)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 193e33251b18f..c011d8dd8f441 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1068,6 +1068,20 @@ int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size, pgprot_t *prot
>  	return 0;
>  }
>
> +int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
> +{
> +	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> +
> +	return reserve_pfn_range(paddr, size, prot, 0);

Nitty, but a pattern established by Liam which we've followed consistently
in VMA code is to prefix parameters that might be less than obvious,
especially boolean parameters, with a comment naming the parameter, e.g.:

	return reserve_pfn_range(paddr, size, prot, /*strict_prot=*/0);

> +}
> +
> +void pfnmap_untrack(unsigned long pfn, unsigned long size)
> +{
> +	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> +
> +	free_pfn_range(paddr, size);
> +}
> +
>  /*
>   * untrack_pfn is called while unmapping a pfnmap for a region.
>   * untrack can be called for a specific region indicated by pfn and size or
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 91aadfe2515a5..898a3ab195578 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1506,6 +1506,16 @@ static inline int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>  	return 0;
>  }
>
> +static inline int pfnmap_track(unsigned long pfn, unsigned long size,
> +		pgprot_t *prot)
> +{
> +	return 0;
> +}
> +
> +static inline void pfnmap_untrack(unsigned long pfn, unsigned long size)
> +{
> +}
> +
>  /*
>   * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
>   * tables copied during copy_page_range(). Will store the pfn to be
> @@ -1570,6 +1580,29 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>   */
>  int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>  		pgprot_t *prot);
> +
> +/**
> + * pfnmap_track - track a pfn range

To risk sounding annoyingly pedantic and giving the kind of review that is
annoying, this really needs to be expanded, I think perhaps this
description is stating the obvious :)

To me the confusing thing is that the 'generic' sounding pfnmap_track() is
actually PAT-specific, so surely the description should give a brief
overview of PAT here, saying it's applicable on x86-64 etc. etc.

I'm not sure there's much use in keeping this generic when it clearly is
not at this point?

> + * @pfn: the start of the pfn range
> + * @size: the size of the pfn range

In what units? Given it's a pfn range it's a bit ambiguous as to whether it
should be expressed in pages/bytes.

> + * @prot: the pgprot to track
> + *
> + * Tracking a pfnmap range involves conditionally reserving a pfn range and
> + * sanitizing the pgprot -- see pfnmap_sanitize_pgprot().
> + *
> + * Returns 0 on success and -EINVAL on error.
> + */
> +int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
> +
> +/**
> + * pfnmap_untrack - untrack a pfn range
> + * @pfn: the start of the pfn range
> + * @size: the size of the pfn range

Same comment as above re: units.

> + *
> + * Untrack a pfn range previously tracked through pfnmap_track(), for example,
> + * un-doing any reservation.
> + */
> +void pfnmap_untrack(unsigned long pfn, unsigned long size);
>  extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>  		struct vm_area_struct *src_vma, unsigned long *pfn);
>  extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack()
  2025-04-25 20:00   ` Peter Xu
  2025-04-25 20:14     ` David Hildenbrand
@ 2025-04-28 16:54     ` Lorenzo Stoakes
  2025-04-28 17:07     ` Lorenzo Stoakes
  2 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 16:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-kernel, linux-mm, x86, intel-gfx,
	dri-devel, linux-trace-kernel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, David Airlie, Simona Vetter, Andrew Morton,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato

On Fri, Apr 25, 2025 at 04:00:42PM -0400, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:17:08AM +0200, David Hildenbrand wrote:
> > Let's use the new, cleaner interface.
> >
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/memremap.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 2aebc1b192da9..c417c843e9b1f 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
> >  	}
> >  	mem_hotplug_done();
> >
> > -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> > +	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
> >  	pgmap_array_delete(range);
> >  }
> >
> > @@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >  	if (nid < 0)
> >  		nid = numa_mem_id();
> >
> > -	error = track_pfn_remap(NULL, &params->pgprot, PHYS_PFN(range->start), 0,
> > -			range_len(range));
> > +	error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
> > +			     &params->pgprot);
> >  	if (error)
> >  		goto err_pfn_remap;
> >
> > @@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >  	if (!is_private)
> >  		kasan_remove_zero_shadow(__va(range->start), range_len(range));
> >  err_kasan:
> > -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> > +	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
>
> Not a huge deal, but maybe we could merge this and previous patch?  It
> might be easier to reference the impl when reading the call site changes.

Agreed, it would add a sort of built-in justification for the prior patch
as in 'here is where we use it, this is why I'm doing this' kind of thing
also.

And I think this is small enough combined with the previous to not be a big
deal to have both together.

Thanks!

>
> >  err_pfn_remap:
> >  	pgmap_array_delete(range);
> >  	return error;
> > --
> > 2.49.0
> >
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack()
  2025-04-25 20:00   ` Peter Xu
  2025-04-25 20:14     ` David Hildenbrand
  2025-04-28 16:54     ` Lorenzo Stoakes
@ 2025-04-28 17:07     ` Lorenzo Stoakes
  2 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 17:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-kernel, linux-mm, x86, intel-gfx,
	dri-devel, linux-trace-kernel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, David Airlie, Simona Vetter, Andrew Morton,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato

On Fri, Apr 25, 2025 at 04:00:42PM -0400, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:17:08AM +0200, David Hildenbrand wrote:
> > Let's use the new, cleaner interface.
> >
> > Signed-off-by: David Hildenbrand <david@redhat.com>

LGTM, albeit though, as discussed elsewhere in thread, this should be
merged with prior patch, though FWIW:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> > ---
> >  mm/memremap.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 2aebc1b192da9..c417c843e9b1f 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
> >  	}
> >  	mem_hotplug_done();
> >
> > -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> > +	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
> >  	pgmap_array_delete(range);
> >  }
> >
> > @@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >  	if (nid < 0)
> >  		nid = numa_mem_id();
> >
> > -	error = track_pfn_remap(NULL, &params->pgprot, PHYS_PFN(range->start), 0,
> > -			range_len(range));
> > +	error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
> > +			     &params->pgprot);
> >  	if (error)
> >  		goto err_pfn_remap;
> >
> > @@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >  	if (!is_private)
> >  		kasan_remove_zero_shadow(__va(range->start), range_len(range));
> >  err_kasan:
> > -	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> > +	pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
>
> Not a huge deal, but maybe we could merge this and previous patch?  It
> might be easier to reference the impl when reading the call site changes.
>
> >  err_pfn_remap:
> >  	pgmap_array_delete(range);
> >  	return error;
> > --
> > 2.49.0
> >
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack()
  2025-04-28 16:53   ` Lorenzo Stoakes
@ 2025-04-28 17:12     ` David Hildenbrand
  2025-04-28 18:58       ` Lorenzo Stoakes
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 17:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu


>>
>> +int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
>> +{
>> +	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
>> +
>> +	return reserve_pfn_range(paddr, size, prot, 0);
> 
> Nitty, but a pattern established by Liam which we've followed consistently
> in VMA code is to prefix parameters that might be less than obvious,
> especially boolean parameters, with a comment naming the parameter, e.g.:
 > > 	return reserve_pfn_range(paddr, size, prot, /*strict_prot=*/0);

Not sure I like that. But as this parameter goes away patch #8, I'll 
leave it as is in this patch and not start a bigger discussion on better 
alternatives (don't use these stupid boolean variables ...) ;)

[...]

>> +
>> +/**
>> + * pfnmap_track - track a pfn range
> 
> To risk sounding annoyingly pedantic and giving the kind of review that is
> annoying, this really needs to be expanded, I think perhaps this
> description is stating the obvious :)
> 
> To me the confusing thing is that the 'generic' sounding pfnmap_track() is
> actually PAT-specific, so surely the description should give a brief
> overview of PAT here, saying it's applicable on x86-64 etc. etc.
> 
> I'm not sure there's much use in keeping this generic when it clearly is
> not at this point?

Sorry, is your suggestion to document more PAT stuff or what exactly?

As you know, I'm a busy man ... so instructions/suggestions please :)

> 
>> + * @pfn: the start of the pfn range
>> + * @size: the size of the pfn range
> 
> In what units? Given it's a pfn range it's a bit ambiguous as to whether it
> should be expressed in pages/bytes.

Agreed. It's bytes. (not my favorite here, but good enough)


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 16:24           ` Peter Xu
@ 2025-04-28 17:23             ` David Hildenbrand
  2025-04-28 19:37               ` Lorenzo Stoakes
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 17:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On 28.04.25 18:24, Peter Xu wrote:
> On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
>>> Probably due to what config you have.  E.g., when I'm looking mine it's
>>> much bigger and already consuming 256B, but it's because I enabled more
>>> things (userfaultfd, lockdep, etc.).
>>
>> Note that I enabled everything that you would expect on a production system
>> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
> 
> I still doubt whether you at least enabled userfaultfd, e.g., your previous
> paste has:
> 
>    struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> 
> Not something that matters.. but just in case you didn't use the expected
> config file you wanted to use..

You're absolutely right. I only briefly rechecked for this purpose here 
on my notebook, and only looked for the existence of members, not 
expecting that we have confusing stuff like vm_userfaultfd_ctx.

I checked again and the size stays at 192 with allyesconfig and then 
disabling debug options.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack()
  2025-04-28 17:12     ` David Hildenbrand
@ 2025-04-28 18:58       ` Lorenzo Stoakes
  0 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 18:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Mon, Apr 28, 2025 at 07:12:11PM +0200, David Hildenbrand wrote:
>
> > >
> > > +int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
> > > +{
> > > +	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> > > +
> > > +	return reserve_pfn_range(paddr, size, prot, 0);
> >
> > Nitty, but a pattern established by Liam which we've followed consistently
> > in VMA code is to prefix parameters that might be less than obvious,
> > especially boolean parameters, with a comment naming the parameter, e.g.:
> > > 	return reserve_pfn_range(paddr, size, prot, /*strict_prot=*/0);
>
> Not sure I like that. But as this parameter goes away patch #8, I'll leave
> it as is in this patch and not start a bigger discussion on better
> alternatives (don't use these stupid boolean variables ...) ;)
>
> [...]
>
> > > +
> > > +/**
> > > + * pfnmap_track - track a pfn range
> >
> > To risk sounding annoyingly pedantic and giving the kind of review that is
> > annoying, this really needs to be expanded, I think perhaps this
> > description is stating the obvious :)
> >
> > To me the confusing thing is that the 'generic' sounding pfnmap_track() is
> > actually PAT-specific, so surely the description should give a brief
> > overview of PAT here, saying it's applicable on x86-64 etc. etc.
> >
> > I'm not sure there's much use in keeping this generic when it clearly is
> > not at this point?
>
> Sorry, is your suggestion to document more PAT stuff or what exactly?
>
> As you know, I'm a busy man ... so instructions/suggestions please :)

Haha sure, I _think_ the model here is to have a brief summary then underneath a
more detailed explanation, so that could be:

	This address range is requested to be 'tracked' by a hardware
	implementation allowing fine-grained control of memory attributes at
	page level granularity.

	This allows for fine-grained control of memory cache behaviour. Tracking
	memory this way is persisted across VMA split and merge.

	Currently there is only one implementation for this - x86 Page Attribute
	Table (PAT). See Documentation/arch/x86/pat.rst for more details.

>
> >
> > > + * @pfn: the start of the pfn range
> > > + * @size: the size of the pfn range
> >
> > In what units? Given it's a pfn range it's a bit ambiguous as to whether it
> > should be expressed in pages/bytes.
>
> Agreed. It's bytes. (not my favorite here, but good enough)

Ack, definitely need to spell it out here! Cheers :)

>
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 17:23             ` David Hildenbrand
@ 2025-04-28 19:37               ` Lorenzo Stoakes
  2025-04-28 19:57                 ` Suren Baghdasaryan
  2025-04-28 20:19                 ` David Hildenbrand
  0 siblings, 2 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 19:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato

On Mon, Apr 28, 2025 at 07:23:18PM +0200, David Hildenbrand wrote:
> On 28.04.25 18:24, Peter Xu wrote:
> > On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
> > > > Probably due to what config you have.  E.g., when I'm looking mine it's
> > > > much bigger and already consuming 256B, but it's because I enabled more
> > > > things (userfaultfd, lockdep, etc.).
> > >
> > > Note that I enabled everything that you would expect on a production system
> > > (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
> >
> > I still doubt whether you at least enabled userfaultfd, e.g., your previous
> > paste has:
> >
> >    struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> >
> > Not something that matters.. but just in case you didn't use the expected
> > config file you wanted to use..
>
> You're absolutely right. I only briefly rechecked for this purpose here on
> my notebook, and only looked for the existence of members, not expecting
> that we have confusing stuff like vm_userfaultfd_ctx.
>
> I checked again and the size stays at 192 with allyesconfig and then
> disabling debug options.

I think a reasonable case is everything on, except CONFIG_DEBUG_LOCK_ALLOC and I
don't care about nommu.

So:

CONFIG_PER_VMA_LOCK
CONFIG_SWAP
CONFIG_MMU (exclude the nommu vm_region field)
CONFIG_NUMA
CONFIG_NUMA_BALANCING
CONFIG_PER_VMA_LOCK
CONFIG_ANON_VMA_NAME
__HAVE_PFNMAP_TRACKING

So to be clear - allyesconfig w/o debug gives us this yes? And we don't add a
cache line? In which case all good :)


>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-25  8:17 ` [PATCH v1 05/11] mm: convert VM_PFNMAP tracking " David Hildenbrand
  2025-04-25 20:23   ` Peter Xu
@ 2025-04-28 19:38   ` Lorenzo Stoakes
  2025-04-28 20:00     ` Suren Baghdasaryan
  2025-04-28 20:10   ` Lorenzo Stoakes
  2 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 19:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu,
	Suren Baghdasaryan

+cc Suren, who has worked HEAVILY on VMA field manipulation and such :)

Suren - David is proposing adding a new field. AFAICT this does not add a
new cache line so I think we're all good.

But FYI!

On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only sanitize the pgprot
> (partial VMA covered).
>
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.
>
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.
>
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm_inline.h |  2 +
>  include/linux/mm_types.h  | 11 ++++++
>  kernel/fork.c             | 54 ++++++++++++++++++++++++--
>  mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>  mm/mremap.c               |  4 --
>  5 files changed, 128 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>
>  #endif  /* CONFIG_ANON_VMA_NAME */
>
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
>  static inline void init_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f91..91124761cfda8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -764,6 +764,14 @@ struct vma_numab_state {
>  	int prev_scan_seq;
>  };
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> +	struct kref kref;
> +	unsigned long pfn;
> +	unsigned long size;
> +};
> +#endif
> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -877,6 +885,9 @@ struct vm_area_struct {
>  	struct anon_vma_name *anon_name;
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING
> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif
>  } __randomize_layout;
>
>  #ifdef CONFIG_NUMA
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 168681fc4b25a..ae518b8fe752c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -481,7 +481,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
>  #ifdef CONFIG_NUMA
>  	dest->vm_policy = src->vm_policy;
>  #endif
> +#ifdef __HAVE_PFNMAP_TRACKING
> +	dest->pfnmap_track_ctx = NULL;
> +#endif
> +}
> +
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +		struct vm_area_struct *new)
> +{
> +	struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> +
> +	if (likely(!ctx))
> +		return 0;
> +
> +	/*
> +	 * We don't expect to ever hit this. If ever required, we would have
> +	 * to duplicate the tracking.
> +	 */
> +	if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
> +		return -ENOMEM;
> +	kref_get(&ctx->kref);
> +	new->pfnmap_track_ctx = ctx;
> +	return 0;
> +}
> +
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +	struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
> +
> +	if (likely(!ctx))
> +		return;
> +
> +	kref_put(&ctx->kref, pfnmap_track_ctx_release);
> +	vma->pfnmap_track_ctx = NULL;
> +}
> +#else
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +		struct vm_area_struct *new)
> +{
> +	return 0;
>  }
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +}
> +#endif
>
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  {
> @@ -493,6 +537,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
>  	vm_area_init_from(orig, new);
> +
> +	if (vma_pfnmap_track_ctx_dup(orig, new)) {
> +		kmem_cache_free(vm_area_cachep, new);
> +		return NULL;
> +	}
>  	vma_lock_init(new, true);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  	vma_numab_state_init(new);
> @@ -507,6 +556,7 @@ void vm_area_free(struct vm_area_struct *vma)
>  	vma_assert_detached(vma);
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
> +	vma_pfnmap_track_ctx_release(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>
> @@ -669,10 +719,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		if (!tmp)
>  			goto fail_nomem;
>
> -		/* track_pfn_copy() will later take care of copying internal state. */
> -		if (unlikely(tmp->vm_flags & VM_PFNMAP))
> -			untrack_pfn_clear(tmp);
> -
>  		retval = vma_dup_policy(mpnt, tmp);
>  		if (retval)
>  			goto fail_nomem_policy;
> diff --git a/mm/memory.c b/mm/memory.c
> index c737a8625866a..eb2b3f10a97ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1370,7 +1370,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	struct mm_struct *dst_mm = dst_vma->vm_mm;
>  	struct mm_struct *src_mm = src_vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	unsigned long next, pfn = 0;
> +	unsigned long next;
>  	bool is_cow;
>  	int ret;
>
> @@ -1380,12 +1380,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	if (is_vm_hugetlb_page(src_vma))
>  		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>
> -	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> -		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	/*
>  	 * We need to invalidate the secondary MMU mappings only when
>  	 * there could be a permission downgrade on the ptes of the
> @@ -1427,8 +1421,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  		raw_write_seqcount_end(&src_mm->write_protect_seq);
>  		mmu_notifier_invalidate_range_end(&range);
>  	}
> -	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn_copy(dst_vma, pfn);
>  	return ret;
>  }
>
> @@ -1923,9 +1915,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  	if (vma->vm_file)
>  		uprobe_munmap(vma, start, end);
>
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
>  	if (start != end) {
>  		if (unlikely(is_vm_hugetlb_page(vma))) {
>  			/*
> @@ -2871,6 +2860,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>  	return error;
>  }
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> +		unsigned long size, pgprot_t *prot)
> +{
> +	struct pfnmap_track_ctx *ctx;
> +
> +	if (pfnmap_track(pfn, size, prot))
> +		return ERR_PTR(-EINVAL);
> +
> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (unlikely(!ctx)) {
> +		pfnmap_untrack(pfn, size);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	ctx->pfn = pfn;
> +	ctx->size = size;
> +	kref_init(&ctx->kref);
> +	return ctx;
> +}
> +
> +void pfnmap_track_ctx_release(struct kref *ref)
> +{
> +	struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> +
> +	pfnmap_untrack(ctx->pfn, ctx->size);
> +	kfree(ctx);
> +}
> +#endif /* __HAVE_PFNMAP_TRACKING */
> +
>  /**
>   * remap_pfn_range - remap kernel memory to userspace
>   * @vma: user vma to map to
> @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * Return: %0 on success, negative error code otherwise.
>   */
> +#ifdef __HAVE_PFNMAP_TRACKING
>  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  		    unsigned long pfn, unsigned long size, pgprot_t prot)
>  {
> +	struct pfnmap_track_ctx *ctx = NULL;
>  	int err;
>
> -	err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> -	if (err)
> +	size = PAGE_ALIGN(size);
> +
> +	/*
> +	 * If we cover the full VMA, we'll perform actual tracking, and
> +	 * remember to untrack when the last reference to our tracking
> +	 * context from a VMA goes away.
> +	 *
> +	 * If we only cover parts of the VMA, we'll only sanitize the
> +	 * pgprot.
> +	 */
> +	if (addr == vma->vm_start && addr + size == vma->vm_end) {
> +		if (vma->pfnmap_track_ctx)
> +			return -EINVAL;
> +		ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (pfnmap_sanitize_pgprot(pfn, size, &prot)) {
>  		return -EINVAL;
> +	}
>
>  	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> -	if (err)
> -		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> +	if (ctx) {
> +		if (err)
> +			kref_put(&ctx->kref, pfnmap_track_ctx_release);
> +		else
> +			vma->pfnmap_track_ctx = ctx;
> +	}
>  	return err;
>  }
> +
> +#else
> +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> +		    unsigned long pfn, unsigned long size, pgprot_t prot)
> +{
> +	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> +}
> +#endif
>  EXPORT_SYMBOL(remap_pfn_range);
>
>  /**
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84f..6e78e02f74bd3 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>  	if (is_vm_hugetlb_page(vma))
>  		clear_vma_resv_huge_pages(vma);
>
> -	/* Tell pfnmap has moved from this vma */
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn_clear(vma);
> -
>  	*new_vma_ptr = new_vma;
>  	return err;
>  }
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 19:37               ` Lorenzo Stoakes
@ 2025-04-28 19:57                 ` Suren Baghdasaryan
  2025-04-28 20:23                   ` David Hildenbrand
  2025-04-28 20:19                 ` David Hildenbrand
  1 sibling, 1 reply; 59+ messages in thread
From: Suren Baghdasaryan @ 2025-04-28 19:57 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Peter Xu, linux-kernel, linux-mm, x86,
	intel-gfx, dri-devel, linux-trace-kernel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, David Airlie, Simona Vetter,
	Andrew Morton, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato

On Mon, Apr 28, 2025 at 12:37 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Apr 28, 2025 at 07:23:18PM +0200, David Hildenbrand wrote:
> > On 28.04.25 18:24, Peter Xu wrote:
> > > On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
> > > > > Probably due to what config you have.  E.g., when I'm looking mine it's
> > > > > much bigger and already consuming 256B, but it's because I enabled more
> > > > > things (userfaultfd, lockdep, etc.).
> > > >
> > > > Note that I enabled everything that you would expect on a production system
> > > > (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
> > >
> > > I still doubt whether you at least enabled userfaultfd, e.g., your previous
> > > paste has:
> > >
> > >    struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
> > >
> > > Not something that matters.. but just in case you didn't use the expected
> > > config file you wanted to use..
> >
> > You're absolutely right. I only briefly rechecked for this purpose here on
> > my notebook, and only looked for the existence of members, not expecting
> > that we have confusing stuff like vm_userfaultfd_ctx.
> >
> > I checked again and the size stays at 192 with allyesconfig and then
> > disabling debug options.
>
> I think a reasonable case is everything on, except CONFIG_DEBUG_LOCK_ALLOC and I
> don't care about nommu.

I think it's safe to assume that production systems would disable
lockdep due to the performance overhead. At least that's what we do on
Android - enable it on development branches but disable in production.

>
> So:
>
> CONFIG_PER_VMA_LOCK
> CONFIG_SWAP
> CONFIG_MMU (exclude the nommu vm_region field)
> CONFIG_NUMA
> CONFIG_NUMA_BALANCING
> CONFIG_PER_VMA_LOCK
> CONFIG_ANON_VMA_NAME
> __HAVE_PFNMAP_TRACKING
>
> So to be clear - allyesconfig w/o debug gives us this yes? And we don't add a
> cache line? In which case all good :)
>
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 19:38   ` Lorenzo Stoakes
@ 2025-04-28 20:00     ` Suren Baghdasaryan
  2025-04-28 20:21       ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Suren Baghdasaryan @ 2025-04-28 20:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, linux-kernel, linux-mm, x86, intel-gfx,
	dri-devel, linux-trace-kernel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, David Airlie, Simona Vetter, Andrew Morton,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Liam R. Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Peter Xu

On Mon, Apr 28, 2025 at 12:47 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> +cc Suren, who has worked HEAVILY on VMA field manipulation and such :)
>
> Suren - David is proposing adding a new field. AFAICT this does not add a
> new cache line so I think we're all good.
>
> But FYI!

Thanks! Yes, there should be some space in the last cacheline after my
last field reshuffling.

>
> On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> > Let's use our new interface. In remap_pfn_range(), we'll now decide
> > whether we have to track (full VMA covered) or only sanitize the pgprot
> > (partial VMA covered).
> >
> > Remember what we have to untrack by linking it from the VMA. When
> > duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> > to anon VMA names, and use a kref to share the tracking.
> >
> > Once the last VMA un-refs our tracking data, we'll do the untracking,
> > which simplifies things a lot and should sort our various issues we saw
> > recently, for example, when partially unmapping/zapping a tracked VMA.
> >
> > This change implies that we'll keep tracking the original PFN range even
> > after splitting + partially unmapping it: not too bad, because it was
> > not working reliably before. The only thing that kind-of worked before
> > was shrinking such a mapping using mremap(): we managed to adjust the
> > reservation in a hacky way, now we won't adjust the reservation but
> > leave it around until all involved VMAs are gone.
> >
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  include/linux/mm_inline.h |  2 +
> >  include/linux/mm_types.h  | 11 ++++++
> >  kernel/fork.c             | 54 ++++++++++++++++++++++++--
> >  mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
> >  mm/mremap.c               |  4 --
> >  5 files changed, 128 insertions(+), 24 deletions(-)
> >
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index f9157a0c42a5c..89b518ff097e6 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
> >
> >  #endif  /* CONFIG_ANON_VMA_NAME */
> >
> > +void pfnmap_track_ctx_release(struct kref *ref);
> > +
> >  static inline void init_tlb_flush_pending(struct mm_struct *mm)
> >  {
> >       atomic_set(&mm->tlb_flush_pending, 0);
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 56d07edd01f91..91124761cfda8 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -764,6 +764,14 @@ struct vma_numab_state {
> >       int prev_scan_seq;
> >  };
> >
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +struct pfnmap_track_ctx {
> > +     struct kref kref;
> > +     unsigned long pfn;
> > +     unsigned long size;
> > +};
> > +#endif
> > +
> >  /*
> >   * This struct describes a virtual memory area. There is one of these
> >   * per VM-area/task. A VM area is any part of the process virtual memory
> > @@ -877,6 +885,9 @@ struct vm_area_struct {
> >       struct anon_vma_name *anon_name;
> >  #endif
> >       struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +     struct pfnmap_track_ctx *pfnmap_track_ctx;
> > +#endif
> >  } __randomize_layout;
> >
> >  #ifdef CONFIG_NUMA
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 168681fc4b25a..ae518b8fe752c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -481,7 +481,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
> >  #ifdef CONFIG_NUMA
> >       dest->vm_policy = src->vm_policy;
> >  #endif
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +     dest->pfnmap_track_ctx = NULL;
> > +#endif
> > +}
> > +
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> > +             struct vm_area_struct *new)
> > +{
> > +     struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> > +
> > +     if (likely(!ctx))
> > +             return 0;
> > +
> > +     /*
> > +      * We don't expect to ever hit this. If ever required, we would have
> > +      * to duplicate the tracking.
> > +      */
> > +     if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
> > +             return -ENOMEM;
> > +     kref_get(&ctx->kref);
> > +     new->pfnmap_track_ctx = ctx;
> > +     return 0;
> > +}
> > +
> > +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> > +{
> > +     struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
> > +
> > +     if (likely(!ctx))
> > +             return;
> > +
> > +     kref_put(&ctx->kref, pfnmap_track_ctx_release);
> > +     vma->pfnmap_track_ctx = NULL;
> > +}
> > +#else
> > +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> > +             struct vm_area_struct *new)
> > +{
> > +     return 0;
> >  }
> > +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> > +{
> > +}
> > +#endif
> >
> >  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >  {
> > @@ -493,6 +537,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> >       ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> >       ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> >       vm_area_init_from(orig, new);
> > +
> > +     if (vma_pfnmap_track_ctx_dup(orig, new)) {
> > +             kmem_cache_free(vm_area_cachep, new);
> > +             return NULL;
> > +     }
> >       vma_lock_init(new, true);
> >       INIT_LIST_HEAD(&new->anon_vma_chain);
> >       vma_numab_state_init(new);
> > @@ -507,6 +556,7 @@ void vm_area_free(struct vm_area_struct *vma)
> >       vma_assert_detached(vma);
> >       vma_numab_state_free(vma);
> >       free_anon_vma_name(vma);
> > +     vma_pfnmap_track_ctx_release(vma);
> >       kmem_cache_free(vm_area_cachep, vma);
> >  }
> >
> > @@ -669,10 +719,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >               if (!tmp)
> >                       goto fail_nomem;
> >
> > -             /* track_pfn_copy() will later take care of copying internal state. */
> > -             if (unlikely(tmp->vm_flags & VM_PFNMAP))
> > -                     untrack_pfn_clear(tmp);
> > -
> >               retval = vma_dup_policy(mpnt, tmp);
> >               if (retval)
> >                       goto fail_nomem_policy;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c737a8625866a..eb2b3f10a97ec 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1370,7 +1370,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >       struct mm_struct *dst_mm = dst_vma->vm_mm;
> >       struct mm_struct *src_mm = src_vma->vm_mm;
> >       struct mmu_notifier_range range;
> > -     unsigned long next, pfn = 0;
> > +     unsigned long next;
> >       bool is_cow;
> >       int ret;
> >
> > @@ -1380,12 +1380,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >       if (is_vm_hugetlb_page(src_vma))
> >               return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
> >
> > -     if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> > -             ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> > -             if (ret)
> > -                     return ret;
> > -     }
> > -
> >       /*
> >        * We need to invalidate the secondary MMU mappings only when
> >        * there could be a permission downgrade on the ptes of the
> > @@ -1427,8 +1421,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >               raw_write_seqcount_end(&src_mm->write_protect_seq);
> >               mmu_notifier_invalidate_range_end(&range);
> >       }
> > -     if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> > -             untrack_pfn_copy(dst_vma, pfn);
> >       return ret;
> >  }
> >
> > @@ -1923,9 +1915,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >       if (vma->vm_file)
> >               uprobe_munmap(vma, start, end);
> >
> > -     if (unlikely(vma->vm_flags & VM_PFNMAP))
> > -             untrack_pfn(vma, 0, 0, mm_wr_locked);
> > -
> >       if (start != end) {
> >               if (unlikely(is_vm_hugetlb_page(vma))) {
> >                       /*
> > @@ -2871,6 +2860,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> >       return error;
> >  }
> >
> > +#ifdef __HAVE_PFNMAP_TRACKING
> > +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> > +             unsigned long size, pgprot_t *prot)
> > +{
> > +     struct pfnmap_track_ctx *ctx;
> > +
> > +     if (pfnmap_track(pfn, size, prot))
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> > +     if (unlikely(!ctx)) {
> > +             pfnmap_untrack(pfn, size);
> > +             return ERR_PTR(-ENOMEM);
> > +     }
> > +
> > +     ctx->pfn = pfn;
> > +     ctx->size = size;
> > +     kref_init(&ctx->kref);
> > +     return ctx;
> > +}
> > +
> > +void pfnmap_track_ctx_release(struct kref *ref)
> > +{
> > +     struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> > +
> > +     pfnmap_untrack(ctx->pfn, ctx->size);
> > +     kfree(ctx);
> > +}
> > +#endif /* __HAVE_PFNMAP_TRACKING */
> > +
> >  /**
> >   * remap_pfn_range - remap kernel memory to userspace
> >   * @vma: user vma to map to
> > @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> >   *
> >   * Return: %0 on success, negative error code otherwise.
> >   */
> > +#ifdef __HAVE_PFNMAP_TRACKING
> >  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> >                   unsigned long pfn, unsigned long size, pgprot_t prot)
> >  {
> > +     struct pfnmap_track_ctx *ctx = NULL;
> >       int err;
> >
> > -     err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> > -     if (err)
> > +     size = PAGE_ALIGN(size);
> > +
> > +     /*
> > +      * If we cover the full VMA, we'll perform actual tracking, and
> > +      * remember to untrack when the last reference to our tracking
> > +      * context from a VMA goes away.
> > +      *
> > +      * If we only cover parts of the VMA, we'll only sanitize the
> > +      * pgprot.
> > +      */
> > +     if (addr == vma->vm_start && addr + size == vma->vm_end) {
> > +             if (vma->pfnmap_track_ctx)
> > +                     return -EINVAL;
> > +             ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> > +             if (IS_ERR(ctx))
> > +                     return PTR_ERR(ctx);
> > +     } else if (pfnmap_sanitize_pgprot(pfn, size, &prot)) {
> >               return -EINVAL;
> > +     }
> >
> >       err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> > -     if (err)
> > -             untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> > +     if (ctx) {
> > +             if (err)
> > +                     kref_put(&ctx->kref, pfnmap_track_ctx_release);
> > +             else
> > +                     vma->pfnmap_track_ctx = ctx;
> > +     }
> >       return err;
> >  }
> > +
> > +#else
> > +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> > +                 unsigned long pfn, unsigned long size, pgprot_t prot)
> > +{
> > +     return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> > +}
> > +#endif
> >  EXPORT_SYMBOL(remap_pfn_range);
> >
> >  /**
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 7db9da609c84f..6e78e02f74bd3 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> >       if (is_vm_hugetlb_page(vma))
> >               clear_vma_resv_huge_pages(vma);
> >
> > -     /* Tell pfnmap has moved from this vma */
> > -     if (unlikely(vma->vm_flags & VM_PFNMAP))
> > -             untrack_pfn_clear(vma);
> > -
> >       *new_vma_ptr = new_vma;
> >       return err;
> >  }
> > --
> > 2.49.0
> >

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-25  8:17 ` [PATCH v1 05/11] mm: convert VM_PFNMAP tracking " David Hildenbrand
  2025-04-25 20:23   ` Peter Xu
  2025-04-28 19:38   ` Lorenzo Stoakes
@ 2025-04-28 20:10   ` Lorenzo Stoakes
  2025-05-05 13:00     ` David Hildenbrand
  2 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu,
	Suren Baghdasaryan

On Fri, Apr 25, 2025 at 10:17:09AM +0200, David Hildenbrand wrote:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only sanitize the pgprot
> (partial VMA covered).

Yeah I do agree with Peter that 'sanitize' is not great here, but naming is
hard :) anyway was discussed in that thread elsewhere...

>
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.

Yes this is sensible.

>
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.

Lovely!

>
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.

Hm, but what if we shrink a VMA, then map another one, might it be
incorrectly storing PAT attributes for part of the range that is now mapped
elsewhere?

Also my god re: the 'kind of working' aspects of PAT, so frustrating.

>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Generally looking good, afaict, but maybe let's get some input from Suren
on VMA size.

Are there actually any PAT tests out here? I had a quick glance in
tools/testing/selftests/x86,mm and couldn't find any, but didn't look
_that_ card.

Thanks in general for tackling this, this is a big improvement!

> ---
>  include/linux/mm_inline.h |  2 +
>  include/linux/mm_types.h  | 11 ++++++
>  kernel/fork.c             | 54 ++++++++++++++++++++++++--
>  mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>  mm/mremap.c               |  4 --
>  5 files changed, 128 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>
>  #endif  /* CONFIG_ANON_VMA_NAME */
>
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
>  static inline void init_tlb_flush_pending(struct mm_struct *mm)
>  {
>  	atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f91..91124761cfda8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -764,6 +764,14 @@ struct vma_numab_state {
>  	int prev_scan_seq;
>  };
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> +	struct kref kref;
> +	unsigned long pfn;
> +	unsigned long size;

Again, (super) nitty, but we really should express units. I suppose 'size'
implies bytes to be honest as you'd unlikely say 'size' for number of pages
(you'd go with nr_pages or something). But maybe a trailing /* in bytes */
would help.

Not a big deal though!

> +};
> +#endif
> +
>  /*
>   * This struct describes a virtual memory area. There is one of these
>   * per VM-area/task. A VM area is any part of the process virtual memory
> @@ -877,6 +885,9 @@ struct vm_area_struct {
>  	struct anon_vma_name *anon_name;
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING

An aside, but absolutely hate '__HAVE_PFNMAP_TRACKING' as a name here. But
you didn't create it, and it's not really sensible to change it in this
series so. Just a rumble...

> +	struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif
>  } __randomize_layout;
>
>  #ifdef CONFIG_NUMA
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 168681fc4b25a..ae518b8fe752c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -481,7 +481,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
>  #ifdef CONFIG_NUMA
>  	dest->vm_policy = src->vm_policy;
>  #endif
> +#ifdef __HAVE_PFNMAP_TRACKING
> +	dest->pfnmap_track_ctx = NULL;
> +#endif
> +}
> +
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +		struct vm_area_struct *new)
> +{
> +	struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> +
> +	if (likely(!ctx))
> +		return 0;
> +
> +	/*
> +	 * We don't expect to ever hit this. If ever required, we would have
> +	 * to duplicate the tracking.
> +	 */
> +	if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
> +		return -ENOMEM;
> +	kref_get(&ctx->kref);
> +	new->pfnmap_track_ctx = ctx;
> +	return 0;
> +}
> +
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +	struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
> +
> +	if (likely(!ctx))
> +		return;
> +
> +	kref_put(&ctx->kref, pfnmap_track_ctx_release);
> +	vma->pfnmap_track_ctx = NULL;
> +}
> +#else
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +		struct vm_area_struct *new)
> +{
> +	return 0;
>  }
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +}
> +#endif
>
>  struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  {

Obviously my series will break this but should be _fairly_ trivial to
update.

You will however have to make sure to update tools/testing/vma/* to handle
the new functions in userland testing (they need to be stubbed otu).

If it makes life easier, you can even send it to me off-list, or just send
it without changing this in a respin and I can fix it up fairly quick for
you.

> @@ -493,6 +537,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
>  	ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
>  	vm_area_init_from(orig, new);
> +
> +	if (vma_pfnmap_track_ctx_dup(orig, new)) {
> +		kmem_cache_free(vm_area_cachep, new);
> +		return NULL;
> +	}
>  	vma_lock_init(new, true);
>  	INIT_LIST_HEAD(&new->anon_vma_chain);
>  	vma_numab_state_init(new);
> @@ -507,6 +556,7 @@ void vm_area_free(struct vm_area_struct *vma)
>  	vma_assert_detached(vma);
>  	vma_numab_state_free(vma);
>  	free_anon_vma_name(vma);
> +	vma_pfnmap_track_ctx_release(vma);
>  	kmem_cache_free(vm_area_cachep, vma);
>  }
>
> @@ -669,10 +719,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		if (!tmp)
>  			goto fail_nomem;
>
> -		/* track_pfn_copy() will later take care of copying internal state. */
> -		if (unlikely(tmp->vm_flags & VM_PFNMAP))
> -			untrack_pfn_clear(tmp);
> -
>  		retval = vma_dup_policy(mpnt, tmp);
>  		if (retval)
>  			goto fail_nomem_policy;
> diff --git a/mm/memory.c b/mm/memory.c
> index c737a8625866a..eb2b3f10a97ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1370,7 +1370,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	struct mm_struct *dst_mm = dst_vma->vm_mm;
>  	struct mm_struct *src_mm = src_vma->vm_mm;
>  	struct mmu_notifier_range range;
> -	unsigned long next, pfn = 0;
> +	unsigned long next;
>  	bool is_cow;
>  	int ret;
>
> @@ -1380,12 +1380,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	if (is_vm_hugetlb_page(src_vma))
>  		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>
> -	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> -		ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> -		if (ret)
> -			return ret;
> -	}
> -

So lovely to see this kind of thing go...

>  	/*
>  	 * We need to invalidate the secondary MMU mappings only when
>  	 * there could be a permission downgrade on the ptes of the
> @@ -1427,8 +1421,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  		raw_write_seqcount_end(&src_mm->write_protect_seq);
>  		mmu_notifier_invalidate_range_end(&range);
>  	}
> -	if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn_copy(dst_vma, pfn);
>  	return ret;
>  }
>
> @@ -1923,9 +1915,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>  	if (vma->vm_file)
>  		uprobe_munmap(vma, start, end);
>
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
>  	if (start != end) {
>  		if (unlikely(is_vm_hugetlb_page(vma))) {
>  			/*
> @@ -2871,6 +2860,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>  	return error;
>  }
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> +		unsigned long size, pgprot_t *prot)
> +{
> +	struct pfnmap_track_ctx *ctx;
> +
> +	if (pfnmap_track(pfn, size, prot))
> +		return ERR_PTR(-EINVAL);
> +
> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (unlikely(!ctx)) {
> +		pfnmap_untrack(pfn, size);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	ctx->pfn = pfn;
> +	ctx->size = size;
> +	kref_init(&ctx->kref);
> +	return ctx;
> +}
> +
> +void pfnmap_track_ctx_release(struct kref *ref)
> +{
> +	struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> +
> +	pfnmap_untrack(ctx->pfn, ctx->size);
> +	kfree(ctx);
> +}
> +#endif /* __HAVE_PFNMAP_TRACKING */
> +
>  /**
>   * remap_pfn_range - remap kernel memory to userspace
>   * @vma: user vma to map to
> @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * Return: %0 on success, negative error code otherwise.
>   */
> +#ifdef __HAVE_PFNMAP_TRACKING
>  int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>  		    unsigned long pfn, unsigned long size, pgprot_t prot)

OK so to expose some of my lack-of-knowledge of PAT - is this the
'entrypoint' to PAT tracking?

So we have some kernel memory we remap to userland as PFN map, the kind
that very well might be sensible to use PAT the change cache behaviour for,
and each time this happens, it's mapped as PAT?

>  {
> +	struct pfnmap_track_ctx *ctx = NULL;
>  	int err;
>
> -	err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> -	if (err)
> +	size = PAGE_ALIGN(size);
> +
> +	/*
> +	 * If we cover the full VMA, we'll perform actual tracking, and
> +	 * remember to untrack when the last reference to our tracking
> +	 * context from a VMA goes away.
> +	 *
> +	 * If we only cover parts of the VMA, we'll only sanitize the
> +	 * pgprot.
> +	 */
> +	if (addr == vma->vm_start && addr + size == vma->vm_end) {
> +		if (vma->pfnmap_track_ctx)
> +			return -EINVAL;
> +		ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (pfnmap_sanitize_pgprot(pfn, size, &prot)) {
>  		return -EINVAL;
> +	}
>
>  	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> -	if (err)
> -		untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> +	if (ctx) {
> +		if (err)
> +			kref_put(&ctx->kref, pfnmap_track_ctx_release);
> +		else
> +			vma->pfnmap_track_ctx = ctx;
> +	}
>  	return err;
>  }
> +
> +#else
> +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> +		    unsigned long pfn, unsigned long size, pgprot_t prot)
> +{
> +	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> +}
> +#endif
>  EXPORT_SYMBOL(remap_pfn_range);
>
>  /**
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84f..6e78e02f74bd3 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>  	if (is_vm_hugetlb_page(vma))
>  		clear_vma_resv_huge_pages(vma);
>
> -	/* Tell pfnmap has moved from this vma */
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> -		untrack_pfn_clear(vma);
> -

Thanks! <3

>  	*new_vma_ptr = new_vma;
>  	return err;
>  }
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 06/11] x86/mm/pat: remove old pfnmap tracking interface
  2025-04-25  8:17 ` [PATCH v1 06/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
@ 2025-04-28 20:12   ` Lorenzo Stoakes
  0 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Fri, Apr 25, 2025 at 10:17:10AM +0200, David Hildenbrand wrote:
> We can now get rid of the old interface along with get_pat_info() and
> follow_phys().
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Oh what a glorious glorious screen of red I see before me... deleted code
is the best code!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  arch/x86/mm/pat/memtype.c | 147 --------------------------------------
>  include/linux/pgtable.h   |  66 -----------------
>  2 files changed, 213 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index c011d8dd8f441..668ebf0065157 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -933,119 +933,6 @@ static void free_pfn_range(u64 paddr, unsigned long size)
>  		memtype_free(paddr, paddr + size);
>  }
>
> -static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
> -		resource_size_t *phys)
> -{
> -	struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start };
> -
> -	if (follow_pfnmap_start(&args))
> -		return -EINVAL;
> -
> -	/* Never return PFNs of anon folios in COW mappings. */
> -	if (!args.special) {
> -		follow_pfnmap_end(&args);
> -		return -EINVAL;
> -	}
> -
> -	*prot = pgprot_val(args.pgprot);
> -	*phys = (resource_size_t)args.pfn << PAGE_SHIFT;
> -	follow_pfnmap_end(&args);
> -	return 0;
> -}
> -
> -static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
> -		pgprot_t *pgprot)
> -{
> -	unsigned long prot;
> -
> -	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT));
> -
> -	/*
> -	 * We need the starting PFN and cachemode used for track_pfn_remap()
> -	 * that covered the whole VMA. For most mappings, we can obtain that
> -	 * information from the page tables. For COW mappings, we might now
> -	 * suddenly have anon folios mapped and follow_phys() will fail.
> -	 *
> -	 * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to
> -	 * detect the PFN. If we need the cachemode as well, we're out of luck
> -	 * for now and have to fail fork().
> -	 */
> -	if (!follow_phys(vma, &prot, paddr)) {
> -		if (pgprot)
> -			*pgprot = __pgprot(prot);
> -		return 0;
> -	}
> -	if (is_cow_mapping(vma->vm_flags)) {
> -		if (pgprot)
> -			return -EINVAL;
> -		*paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT;
> -		return 0;
> -	}
> -	WARN_ON_ONCE(1);
> -	return -EINVAL;
> -}
> -
> -int track_pfn_copy(struct vm_area_struct *dst_vma,
> -		struct vm_area_struct *src_vma, unsigned long *pfn)
> -{
> -	const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
> -	resource_size_t paddr;
> -	pgprot_t pgprot;
> -	int rc;
> -
> -	if (!(src_vma->vm_flags & VM_PAT))
> -		return 0;
> -
> -	/*
> -	 * Duplicate the PAT information for the dst VMA based on the src
> -	 * VMA.
> -	 */
> -	if (get_pat_info(src_vma, &paddr, &pgprot))
> -		return -EINVAL;
> -	rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
> -	if (rc)
> -		return rc;
> -
> -	/* Reservation for the destination VMA succeeded. */
> -	vm_flags_set(dst_vma, VM_PAT);
> -	*pfn = PHYS_PFN(paddr);
> -	return 0;
> -}
> -
> -void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
> -{
> -	untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
> -	/*
> -	 * Reservation was freed, any copied page tables will get cleaned
> -	 * up later, but without getting PAT involved again.
> -	 */
> -}
> -
> -/*
> - * prot is passed in as a parameter for the new mapping. If the vma has
> - * a linear pfn mapping for the entire range, or no vma is provided,
> - * reserve the entire pfn + size range with single reserve_pfn_range
> - * call.
> - */
> -int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> -		    unsigned long pfn, unsigned long addr, unsigned long size)
> -{
> -	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> -
> -	/* reserve the whole chunk starting from paddr */
> -	if (!vma || (addr == vma->vm_start
> -				&& size == (vma->vm_end - vma->vm_start))) {
> -		int ret;
> -
> -		ret = reserve_pfn_range(paddr, size, prot, 0);
> -		if (ret == 0 && vma)
> -			vm_flags_set(vma, VM_PAT);
> -		return ret;
> -	}
> -
> -	return pfnmap_sanitize_pgprot(pfn, size, prot);
> -}
> -
>  int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size, pgprot_t *prot)
>  {
>  	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> @@ -1082,40 +969,6 @@ void pfnmap_untrack(unsigned long pfn, unsigned long size)
>  	free_pfn_range(paddr, size);
>  }
>
> -/*
> - * untrack_pfn is called while unmapping a pfnmap for a region.
> - * untrack can be called for a specific region indicated by pfn and size or
> - * can be for the entire vma (in which case pfn, size are zero).
> - */
> -void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> -		 unsigned long size, bool mm_wr_locked)
> -{
> -	resource_size_t paddr;
> -
> -	if (vma && !(vma->vm_flags & VM_PAT))
> -		return;
> -
> -	/* free the chunk starting from pfn or the whole chunk */
> -	paddr = (resource_size_t)pfn << PAGE_SHIFT;
> -	if (!paddr && !size) {
> -		if (get_pat_info(vma, &paddr, NULL))
> -			return;
> -		size = vma->vm_end - vma->vm_start;
> -	}
> -	free_pfn_range(paddr, size);
> -	if (vma) {
> -		if (mm_wr_locked)
> -			vm_flags_clear(vma, VM_PAT);
> -		else
> -			__vm_flags_mod(vma, 0, VM_PAT);
> -	}
> -}
> -
> -void untrack_pfn_clear(struct vm_area_struct *vma)
> -{
> -	vm_flags_clear(vma, VM_PAT);
> -}
> -
>  pgprot_t pgprot_writecombine(pgprot_t prot)
>  {
>  	pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 898a3ab195578..0ffc6b9339182 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1489,17 +1489,6 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
>   * vmf_insert_pfn.
>   */
>
> -/*
> - * track_pfn_remap is called when a _new_ pfn mapping is being established
> - * by remap_pfn_range() for physical range indicated by pfn and size.
> - */
> -static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> -				  unsigned long pfn, unsigned long addr,
> -				  unsigned long size)
> -{
> -	return 0;
> -}
> -
>  static inline int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>  		pgprot_t *prot)
>  {
> @@ -1515,55 +1504,7 @@ static inline int pfnmap_track(unsigned long pfn, unsigned long size,
>  static inline void pfnmap_untrack(unsigned long pfn, unsigned long size)
>  {
>  }
> -
> -/*
> - * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
> - * tables copied during copy_page_range(). Will store the pfn to be
> - * passed to untrack_pfn_copy() only if there is something to be untracked.
> - * Callers should initialize the pfn to 0.
> - */
> -static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
> -		struct vm_area_struct *src_vma, unsigned long *pfn)
> -{
> -	return 0;
> -}
> -
> -/*
> - * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
> - * copy_page_range(), but after track_pfn_copy() was already called. Can
> - * be called even if track_pfn_copy() did not actually track anything:
> - * handled internally.
> - */
> -static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> -		unsigned long pfn)
> -{
> -}
> -
> -/*
> - * untrack_pfn is called while unmapping a pfnmap for a region.
> - * untrack can be called for a specific region indicated by pfn and size or
> - * can be for the entire vma (in which case pfn, size are zero).
> - */
> -static inline void untrack_pfn(struct vm_area_struct *vma,
> -			       unsigned long pfn, unsigned long size,
> -			       bool mm_wr_locked)
> -{
> -}
> -
> -/*
> - * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
> - *
> - * 1) During mremap() on the src VMA after the page tables were moved.
> - * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
> - */
> -static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> -{
> -}
>  #else
> -extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> -			   unsigned long pfn, unsigned long addr,
> -			   unsigned long size);
> -
>  /**
>   * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
>   * @pfn: the start of the pfn range
> @@ -1603,13 +1544,6 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
>   * un-doing any reservation.
>   */
>  void pfnmap_untrack(unsigned long pfn, unsigned long size);
> -extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> -		struct vm_area_struct *src_vma, unsigned long *pfn);
> -extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> -		unsigned long pfn);
> -extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> -			unsigned long size, bool mm_wr_locked);
> -extern void untrack_pfn_clear(struct vm_area_struct *vma);
>  #endif
>
>  #ifdef CONFIG_MMU
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 07/11] mm: remove VM_PAT
  2025-04-25  8:17 ` [PATCH v1 07/11] mm: remove VM_PAT David Hildenbrand
@ 2025-04-28 20:16   ` Lorenzo Stoakes
  0 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Fri, Apr 25, 2025 at 10:17:11AM +0200, David Hildenbrand wrote:
> It's unused, so let's remove it.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

I have only <3 for this patch :) byyyyeee VM_PAT!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mm.h             | 4 +---
>  include/trace/events/mmflags.h | 4 +---
>  2 files changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9b701cfbef223..a205020e2a58b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -357,9 +357,7 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_SHADOW_STACK	VM_NONE
>  #endif
>
> -#if defined(CONFIG_X86)
> -# define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
> -#elif defined(CONFIG_PPC64)
> +#if defined(CONFIG_PPC64)
>  # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
>  #elif defined(CONFIG_PARISC)
>  # define VM_GROWSUP	VM_ARCH_1
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 15aae955a10bf..aa441f593e9a6 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -172,9 +172,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
>  	__def_pageflag_names						\
>  	) : "none"
>
> -#if defined(CONFIG_X86)
> -#define __VM_ARCH_SPECIFIC_1 {VM_PAT,     "pat"           }
> -#elif defined(CONFIG_PPC64)
> +#if defined(CONFIG_PPC64)
>  #define __VM_ARCH_SPECIFIC_1 {VM_SAO,     "sao"           }
>  #elif defined(CONFIG_PARISC)
>  #define __VM_ARCH_SPECIFIC_1 {VM_GROWSUP,	"growsup"	}
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 08/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
  2025-04-25  8:17 ` [PATCH v1 08/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
@ 2025-04-28 20:18   ` Lorenzo Stoakes
  0 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Fri, Apr 25, 2025 at 10:17:12AM +0200, David Hildenbrand wrote:
> Always set to 0, so let's remove it.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Ah yes here is where you remove it :)

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  arch/x86/mm/pat/memtype.c | 12 +++---------
>  1 file changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 668ebf0065157..57e3ced4c28cb 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -858,8 +858,7 @@ int memtype_kernel_map_sync(u64 base, unsigned long size,
>   * Reserved non RAM regions only and after successful memtype_reserve,
>   * this func also keeps identity mapping (if any) in sync with this new prot.
>   */
> -static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
> -				int strict_prot)
> +static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot)
>  {
>  	int is_ram = 0;
>  	int ret;
> @@ -895,8 +894,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
>  		return ret;
>
>  	if (pcm != want_pcm) {
> -		if (strict_prot ||
> -		    !is_new_memtype_allowed(paddr, size, want_pcm, pcm)) {
> +		if (!is_new_memtype_allowed(paddr, size, want_pcm, pcm)) {
>  			memtype_free(paddr, paddr + size);
>  			pr_err("x86/PAT: %s:%d map pfn expected mapping type %s for [mem %#010Lx-%#010Lx], got %s\n",
>  			       current->comm, current->pid,
> @@ -906,10 +904,6 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
>  			       cattr_name(pcm));
>  			return -EINVAL;
>  		}
> -		/*
> -		 * We allow returning different type than the one requested in
> -		 * non strict case.
> -		 */
>  		pgprot_set_cachemode(vma_prot, pcm);
>  	}
>
> @@ -959,7 +953,7 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
>  {
>  	const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
>
> -	return reserve_pfn_range(paddr, size, prot, 0);
> +	return reserve_pfn_range(paddr, size, prot);
>  }
>
>  void pfnmap_untrack(unsigned long pfn, unsigned long size)
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 19:37               ` Lorenzo Stoakes
  2025-04-28 19:57                 ` Suren Baghdasaryan
@ 2025-04-28 20:19                 ` David Hildenbrand
  1 sibling, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 20:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Xu, linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato

On 28.04.25 21:37, Lorenzo Stoakes wrote:
> On Mon, Apr 28, 2025 at 07:23:18PM +0200, David Hildenbrand wrote:
>> On 28.04.25 18:24, Peter Xu wrote:
>>> On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
>>>>> Probably due to what config you have.  E.g., when I'm looking mine it's
>>>>> much bigger and already consuming 256B, but it's because I enabled more
>>>>> things (userfaultfd, lockdep, etc.).
>>>>
>>>> Note that I enabled everything that you would expect on a production system
>>>> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
>>>
>>> I still doubt whether you at least enabled userfaultfd, e.g., your previous
>>> paste has:
>>>
>>>     struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
>>>
>>> Not something that matters.. but just in case you didn't use the expected
>>> config file you wanted to use..
>>
>> You're absolutely right. I only briefly rechecked for this purpose here on
>> my notebook, and only looked for the existence of members, not expecting
>> that we have confusing stuff like vm_userfaultfd_ctx.
>>
>> I checked again and the size stays at 192 with allyesconfig and then
>> disabling debug options.
> 
> I think a reasonable case is everything on, except CONFIG_DEBUG_LOCK_ALLOC and I
> don't care about nommu.
> 
> So:
> 
> CONFIG_PER_VMA_LOCK
> CONFIG_SWAP
> CONFIG_MMU (exclude the nommu vm_region field)
> CONFIG_NUMA
> CONFIG_NUMA_BALANCING
> CONFIG_PER_VMA_LOCK
> CONFIG_ANON_VMA_NAME
> __HAVE_PFNMAP_TRACKING

Yes.

And our ugly friend CONFIG_USERFAULTFD

that is

struct vm_userfaultfd_ctx {
	struct userfaultfd_ctx *ctx;
};
#else /* CONFIG_USERFAULTFD */
#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
struct vm_userfaultfd_ctx {};
#endif /* CONFIG_USERFAULTFD */

(yes, you made the same mistake as I made when skimming if everything 
relevant was enabled)

> 
> So to be clear - allyesconfig w/o debug gives us this yes? And we don't add a
> cache line? In which case all good :)

Looks like it!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 20:00     ` Suren Baghdasaryan
@ 2025-04-28 20:21       ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 20:21 UTC (permalink / raw)
  To: Suren Baghdasaryan, Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On 28.04.25 22:00, Suren Baghdasaryan wrote:
> On Mon, Apr 28, 2025 at 12:47 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> +cc Suren, who has worked HEAVILY on VMA field manipulation and such :)
>>
>> Suren - David is proposing adding a new field. AFAICT this does not add a
>> new cache line so I think we're all good.
>>
>> But FYI!
> 
> Thanks! Yes, there should be some space in the last cacheline after my
> last field reshuffling.

That explains why -- that last time I looked at this -- there was no 
easy space available. Thanks for that!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH
  2025-04-25  8:17 ` [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
@ 2025-04-28 20:23   ` Lorenzo Stoakes
  2025-05-05 12:10     ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Fri, Apr 25, 2025 at 10:17:13AM +0200, David Hildenbrand wrote:
> The "memramp() shrinking" scenario no longer applies, so let's remove
> that now-unnecessary handling.

I wonder if we could remove even more of the code here given the
simplifications here? But not a big deal.

>
> Signed-off-by: David Hildenbrand <david@redhat.com>

More lovely removal...

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  arch/x86/mm/pat/memtype_interval.c | 44 ++++--------------------------
>  1 file changed, 6 insertions(+), 38 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
> index 645613d59942a..9d03f0dbc4715 100644
> --- a/arch/x86/mm/pat/memtype_interval.c
> +++ b/arch/x86/mm/pat/memtype_interval.c
> @@ -49,26 +49,15 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
>
>  static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
>
> -enum {
> -	MEMTYPE_EXACT_MATCH	= 0,
> -	MEMTYPE_END_MATCH	= 1
> -};
> -
> -static struct memtype *memtype_match(u64 start, u64 end, int match_type)
> +static struct memtype *memtype_match(u64 start, u64 end)
>  {
>  	struct memtype *entry_match;
>
>  	entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
>
>  	while (entry_match != NULL && entry_match->start < end) {
> -		if ((match_type == MEMTYPE_EXACT_MATCH) &&
> -		    (entry_match->start == start) && (entry_match->end == end))
> -			return entry_match;
> -
> -		if ((match_type == MEMTYPE_END_MATCH) &&
> -		    (entry_match->start < start) && (entry_match->end == end))
> +		if (entry_match->start == start && entry_match->end == end)
>  			return entry_match;
> -
>  		entry_match = interval_iter_next(entry_match, start, end-1);
>  	}
>
> @@ -132,32 +121,11 @@ struct memtype *memtype_erase(u64 start, u64 end)
>  {
>  	struct memtype *entry_old;
>
> -	/*
> -	 * Since the memtype_rbroot tree allows overlapping ranges,
> -	 * memtype_erase() checks with EXACT_MATCH first, i.e. free
> -	 * a whole node for the munmap case.  If no such entry is found,
> -	 * it then checks with END_MATCH, i.e. shrink the size of a node
> -	 * from the end for the mremap case.
> -	 */
> -	entry_old = memtype_match(start, end, MEMTYPE_EXACT_MATCH);
> -	if (!entry_old) {
> -		entry_old = memtype_match(start, end, MEMTYPE_END_MATCH);
> -		if (!entry_old)
> -			return ERR_PTR(-EINVAL);
> -	}
> -
> -	if (entry_old->start == start) {
> -		/* munmap: erase this node */
> -		interval_remove(entry_old, &memtype_rbroot);
> -	} else {
> -		/* mremap: update the end value of this node */
> -		interval_remove(entry_old, &memtype_rbroot);
> -		entry_old->end = start;
> -		interval_insert(entry_old, &memtype_rbroot);
> -
> -		return NULL;
> -	}
> +	entry_old = memtype_match(start, end);
> +	if (!entry_old)
> +		return ERR_PTR(-EINVAL);
>
> +	interval_remove(entry_old, &memtype_rbroot);
>  	return entry_old;
>  }
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 10/11] drm/i915: track_pfn() -> "pfnmap tracking"
  2025-04-25  8:17 ` [PATCH v1 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
@ 2025-04-28 20:23   ` Lorenzo Stoakes
  0 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-04-28 20:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Fri, Apr 25, 2025 at 10:17:14AM +0200, David Hildenbrand wrote:
> track_pfn() does not exist, let's simply refer to it as "pfnmap
> tracking".
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  drivers/gpu/drm/i915/i915_mm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_mm.c b/drivers/gpu/drm/i915/i915_mm.c
> index 76e2801619f09..c33bd3d830699 100644
> --- a/drivers/gpu/drm/i915/i915_mm.c
> +++ b/drivers/gpu/drm/i915/i915_mm.c
> @@ -100,7 +100,7 @@ int remap_io_mapping(struct vm_area_struct *vma,
>
>  	GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
>
> -	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
> +	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
>  	r.mm = vma->vm_mm;
>  	r.pfn = pfn;
>  	r.prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
> @@ -140,7 +140,7 @@ int remap_io_sg(struct vm_area_struct *vma,
>  	};
>  	int err;
>
> -	/* We rely on prevalidation of the io-mapping to skip track_pfn(). */
> +	/* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
>  	GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
>
>  	while (offset >= r.sgt.max >> PAGE_SHIFT) {
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 19:57                 ` Suren Baghdasaryan
@ 2025-04-28 20:23                   ` David Hildenbrand
  0 siblings, 0 replies; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 20:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, Lorenzo Stoakes
  Cc: Peter Xu, linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato

On 28.04.25 21:57, Suren Baghdasaryan wrote:
> On Mon, Apr 28, 2025 at 12:37 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> On Mon, Apr 28, 2025 at 07:23:18PM +0200, David Hildenbrand wrote:
>>> On 28.04.25 18:24, Peter Xu wrote:
>>>> On Mon, Apr 28, 2025 at 06:16:21PM +0200, David Hildenbrand wrote:
>>>>>> Probably due to what config you have.  E.g., when I'm looking mine it's
>>>>>> much bigger and already consuming 256B, but it's because I enabled more
>>>>>> things (userfaultfd, lockdep, etc.).
>>>>>
>>>>> Note that I enabled everything that you would expect on a production system
>>>>> (incld. userfaultfd, mempolicy, per-vma locks), so I didn't enable lockep.
>>>>
>>>> I still doubt whether you at least enabled userfaultfd, e.g., your previous
>>>> paste has:
>>>>
>>>>     struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     0 */
>>>>
>>>> Not something that matters.. but just in case you didn't use the expected
>>>> config file you wanted to use..
>>>
>>> You're absolutely right. I only briefly rechecked for this purpose here on
>>> my notebook, and only looked for the existence of members, not expecting
>>> that we have confusing stuff like vm_userfaultfd_ctx.
>>>
>>> I checked again and the size stays at 192 with allyesconfig and then
>>> disabling debug options.
>>
>> I think a reasonable case is everything on, except CONFIG_DEBUG_LOCK_ALLOC and I
>> don't care about nommu.
> 
> I think it's safe to assume that production systems would disable
> lockdep due to the performance overhead. At least that's what we do on
> Android - enable it on development branches but disable in production.

Right, and "struct lockdep_map" is ... significantly larger than 8 
bytes. With that enabled, one is already paying for extra VMA space ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-28 16:21           ` Peter Xu
@ 2025-04-28 20:37             ` David Hildenbrand
  2025-04-29 13:44               ` Peter Xu
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-28 20:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On 28.04.25 18:21, Peter Xu wrote:
> On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
>>
>>>> What it does on PAT (only implementation so far ...) is looking up the
>>>> memory type to select the caching mode that can be use.
>>>>
>>>> "sanitize" was IMHO a good fit, because we must make sure that we don't use
>>>> the wrong caching mode.
>>>>
>>>> update/setup/... don't make that quite clear. Any other suggestions?
>>>
>>> I'm very poor on naming.. :( So far anything seems slightly better than
>>> sanitize to me, as the word "sanitize" is actually also used in memtype.c
>>> for other purpose.. see sanitize_phys().
>>
>> Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
>> the other functions it's an address.
>>
>> Likely we should just call it pfnmap_X_cachemode()/
>>
>> Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
>>
>> pfnmap_setup_cachemode() ? Hm.
> 
> Sounds good here.

Okay, I'll use that one. If ever something else besides PAT would 
require different semantics, they can bother with finding a better name :)

> 
>>
>>>
>>>>
>>>>>
>>>>>> + * @pfn: the start of the pfn range
>>>>>> + * @size: the size of the pfn range
>>>>>> + * @prot: the pgprot to sanitize
>>>>>> + *
>>>>>> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
>>>>>> + * cachemode.
>>>>>> + *
>>>>>> + * This function cannot fail for a single page, but can fail for multiple
>>>>>> + * pages.
>>>>>> + *
>>>>>> + * Returns 0 on success and -EINVAL on error.
>>>>>> + */
>>>>>> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>>>>>> +		pgprot_t *prot);
>>>>>>     extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>>>>>>     		struct vm_area_struct *src_vma, unsigned long *pfn);
>>>>>>     extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index fdcf0a6049b9f..b8ae5e1493315 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>>>>>     			return VM_FAULT_OOM;
>>>>>>     	}
>>>>>> -	track_pfn_insert(vma, &pgprot, pfn);
>>>>>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>>>>>> +		return VM_FAULT_FALLBACK;
>>>>>
>>>>> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
>>>>> trigger, though.
>>>>>
>>>>> Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
>>>>> replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
>>>>> a win already in all cases.
>>>>
>>>> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
>>>> certainly helpful for the single-page case.
>>>>
>>>> Regarding never failing here: we should check the whole range. We have to
>>>> make sure that none of the pages has a memory type / caching mode that is
>>>> incompatible with what we setup.
>>>
>>> Would it happen in real world?
>>>> IIUC per-vma registration needs to happen first, which checks for
>> memtype
>>> conflicts in the first place, or reserve_pfn_range() could already have
>>> failed.
>>>> Here it's the fault path looking up the memtype, so I would expect it is
>>> guaranteed all pfns under the same vma is following the verified (and same)
>>> memtype?
>>
>> The whole point of track_pfn_insert() is that it is used when we *don't* use
>> reserve_pfn_range()->track_pfn_remap(), no?
>>
>> track_pfn_remap() would check the whole range that gets mapped, so
>> track_pfn_insert() user must similarly check the whole range that gets
>> mapped.
>>
>> Note that even track_pfn_insert() is already pretty clear on the intended
>> usage: "called when a _new_ single pfn is established"
> 
> We need to define "new" then..  But I agree it's not crystal clear at
> least.  I think I just wasn't the first to assume it was reserved, see this
> (especially, the "Expectation" part..):
> 
> commit 5180da410db6369d1f95c9014da1c9bc33fb043e
> Author: Suresh Siddha <suresh.b.siddha@intel.com>
> Date:   Mon Oct 8 16:28:29 2012 -0700
> 
>      x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn
>      
>      With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
>      attribute and uses it.  Expectation is that the driver reserves the
>      memory attributes for the pfn before calling vm_insert_pfn().

It's all confusing.

We do have the following functions relevant in pat code:

(1) memtype_reserve(): used by ioremap and set_memory_XX

(2) memtype_reserve_io(): used by iomap

(3) reserve_pfn_range(): only remap_pfn_range() calls it

(4) arch_io_reserve_memtype_wc()


Which one would perform the reservation for, say, vfio?


I agree that if there would be a guarantee/expectation that all PFNs 
have the same memtype (from previous reservation), it would be 
sufficient to check a single PFN, and we could document that. I just 
don't easily see where that reservation is happening.

So a pointer to that would be appreciated!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-28 20:37             ` David Hildenbrand
@ 2025-04-29 13:44               ` Peter Xu
  2025-04-29 16:25                 ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Xu @ 2025-04-29 13:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Mon, Apr 28, 2025 at 10:37:49PM +0200, David Hildenbrand wrote:
> On 28.04.25 18:21, Peter Xu wrote:
> > On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
> > > 
> > > > > What it does on PAT (only implementation so far ...) is looking up the
> > > > > memory type to select the caching mode that can be use.
> > > > > 
> > > > > "sanitize" was IMHO a good fit, because we must make sure that we don't use
> > > > > the wrong caching mode.
> > > > > 
> > > > > update/setup/... don't make that quite clear. Any other suggestions?
> > > > 
> > > > I'm very poor on naming.. :( So far anything seems slightly better than
> > > > sanitize to me, as the word "sanitize" is actually also used in memtype.c
> > > > for other purpose.. see sanitize_phys().
> > > 
> > > Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
> > > the other functions it's an address.
> > > 
> > > Likely we should just call it pfnmap_X_cachemode()/
> > > 
> > > Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
> > > 
> > > pfnmap_setup_cachemode() ? Hm.
> > 
> > Sounds good here.
> 
> Okay, I'll use that one. If ever something else besides PAT would require
> different semantics, they can bother with finding a better name :)
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > + * @pfn: the start of the pfn range
> > > > > > > + * @size: the size of the pfn range
> > > > > > > + * @prot: the pgprot to sanitize
> > > > > > > + *
> > > > > > > + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> > > > > > > + * cachemode.
> > > > > > > + *
> > > > > > > + * This function cannot fail for a single page, but can fail for multiple
> > > > > > > + * pages.
> > > > > > > + *
> > > > > > > + * Returns 0 on success and -EINVAL on error.
> > > > > > > + */
> > > > > > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > > > > > +		pgprot_t *prot);
> > > > > > >     extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > > >     		struct vm_area_struct *src_vma, unsigned long *pfn);
> > > > > > >     extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > > > > > --- a/mm/huge_memory.c
> > > > > > > +++ b/mm/huge_memory.c
> > > > > > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> > > > > > >     			return VM_FAULT_OOM;
> > > > > > >     	}
> > > > > > > -	track_pfn_insert(vma, &pgprot, pfn);
> > > > > > > +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> > > > > > > +		return VM_FAULT_FALLBACK;
> > > > > > 
> > > > > > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> > > > > > trigger, though.
> > > > > > 
> > > > > > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> > > > > > replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> > > > > > a win already in all cases.
> > > > > 
> > > > > It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
> > > > > certainly helpful for the single-page case.
> > > > > 
> > > > > Regarding never failing here: we should check the whole range. We have to
> > > > > make sure that none of the pages has a memory type / caching mode that is
> > > > > incompatible with what we setup.
> > > > 
> > > > Would it happen in real world?
> > > > > IIUC per-vma registration needs to happen first, which checks for
> > > memtype
> > > > conflicts in the first place, or reserve_pfn_range() could already have
> > > > failed.
> > > > > Here it's the fault path looking up the memtype, so I would expect it is
> > > > guaranteed all pfns under the same vma is following the verified (and same)
> > > > memtype?
> > > 
> > > The whole point of track_pfn_insert() is that it is used when we *don't* use
> > > reserve_pfn_range()->track_pfn_remap(), no?
> > > 
> > > track_pfn_remap() would check the whole range that gets mapped, so
> > > track_pfn_insert() user must similarly check the whole range that gets
> > > mapped.
> > > 
> > > Note that even track_pfn_insert() is already pretty clear on the intended
> > > usage: "called when a _new_ single pfn is established"
> > 
> > We need to define "new" then..  But I agree it's not crystal clear at
> > least.  I think I just wasn't the first to assume it was reserved, see this
> > (especially, the "Expectation" part..):
> > 
> > commit 5180da410db6369d1f95c9014da1c9bc33fb043e
> > Author: Suresh Siddha <suresh.b.siddha@intel.com>
> > Date:   Mon Oct 8 16:28:29 2012 -0700
> > 
> >      x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn
> >      With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
> >      attribute and uses it.  Expectation is that the driver reserves the
> >      memory attributes for the pfn before calling vm_insert_pfn().
> 
> It's all confusing.
> 
> We do have the following functions relevant in pat code:
> 
> (1) memtype_reserve(): used by ioremap and set_memory_XX
> 
> (2) memtype_reserve_io(): used by iomap
> 
> (3) reserve_pfn_range(): only remap_pfn_range() calls it
> 
> (4) arch_io_reserve_memtype_wc()
> 
> 
> Which one would perform the reservation for, say, vfio?

My understanding is it was done via barmap.  See this stack:

vfio_pci_core_mmap
  pci_iomap
    pci_iomap_range
      ... 
        __ioremap_caller
          memtype_reserve

> 
> 
> I agree that if there would be a guarantee/expectation that all PFNs have
> the same memtype (from previous reservation), it would be sufficient to
> check a single PFN, and we could document that. I just don't easily see
> where that reservation is happening.
> 
> So a pointer to that would be appreciated!

I am not aware of any pointer.. maybe others could chime in.

IMHO, if there's anything uncertain, for this one we could always decouple
this issue from the core issue you're working on, so at least it keeps the
old behavior (which is pure lookup on pfn injections) until a solid issue
occurs?  It avoids the case where we could introduce unnecessary code but
then it's much harder to justify a removal.  What do you think?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-29 13:44               ` Peter Xu
@ 2025-04-29 16:25                 ` David Hildenbrand
  2025-04-29 16:36                   ` Peter Xu
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-04-29 16:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On 29.04.25 15:44, Peter Xu wrote:
> On Mon, Apr 28, 2025 at 10:37:49PM +0200, David Hildenbrand wrote:
>> On 28.04.25 18:21, Peter Xu wrote:
>>> On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
>>>>
>>>>>> What it does on PAT (only implementation so far ...) is looking up the
>>>>>> memory type to select the caching mode that can be use.
>>>>>>
>>>>>> "sanitize" was IMHO a good fit, because we must make sure that we don't use
>>>>>> the wrong caching mode.
>>>>>>
>>>>>> update/setup/... don't make that quite clear. Any other suggestions?
>>>>>
>>>>> I'm very poor on naming.. :( So far anything seems slightly better than
>>>>> sanitize to me, as the word "sanitize" is actually also used in memtype.c
>>>>> for other purpose.. see sanitize_phys().
>>>>
>>>> Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
>>>> the other functions it's an address.
>>>>
>>>> Likely we should just call it pfnmap_X_cachemode()/
>>>>
>>>> Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
>>>>
>>>> pfnmap_setup_cachemode() ? Hm.
>>>
>>> Sounds good here.
>>
>> Okay, I'll use that one. If ever something else besides PAT would require
>> different semantics, they can bother with finding a better name :)
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> + * @pfn: the start of the pfn range
>>>>>>>> + * @size: the size of the pfn range
>>>>>>>> + * @prot: the pgprot to sanitize
>>>>>>>> + *
>>>>>>>> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
>>>>>>>> + * cachemode.
>>>>>>>> + *
>>>>>>>> + * This function cannot fail for a single page, but can fail for multiple
>>>>>>>> + * pages.
>>>>>>>> + *
>>>>>>>> + * Returns 0 on success and -EINVAL on error.
>>>>>>>> + */
>>>>>>>> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>>>>>>>> +		pgprot_t *prot);
>>>>>>>>      extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>>>>>>>>      		struct vm_area_struct *src_vma, unsigned long *pfn);
>>>>>>>>      extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index fdcf0a6049b9f..b8ae5e1493315 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>>>>>>>      			return VM_FAULT_OOM;
>>>>>>>>      	}
>>>>>>>> -	track_pfn_insert(vma, &pgprot, pfn);
>>>>>>>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>>>>>>>> +		return VM_FAULT_FALLBACK;
>>>>>>>
>>>>>>> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
>>>>>>> trigger, though.
>>>>>>>
>>>>>>> Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
>>>>>>> replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
>>>>>>> a win already in all cases.
>>>>>>
>>>>>> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
>>>>>> certainly helpful for the single-page case.
>>>>>>
>>>>>> Regarding never failing here: we should check the whole range. We have to
>>>>>> make sure that none of the pages has a memory type / caching mode that is
>>>>>> incompatible with what we setup.
>>>>>
>>>>> Would it happen in real world?
>>>>>> IIUC per-vma registration needs to happen first, which checks for
>>>> memtype
>>>>> conflicts in the first place, or reserve_pfn_range() could already have
>>>>> failed.
>>>>>> Here it's the fault path looking up the memtype, so I would expect it is
>>>>> guaranteed all pfns under the same vma is following the verified (and same)
>>>>> memtype?
>>>>
>>>> The whole point of track_pfn_insert() is that it is used when we *don't* use
>>>> reserve_pfn_range()->track_pfn_remap(), no?
>>>>
>>>> track_pfn_remap() would check the whole range that gets mapped, so
>>>> track_pfn_insert() user must similarly check the whole range that gets
>>>> mapped.
>>>>
>>>> Note that even track_pfn_insert() is already pretty clear on the intended
>>>> usage: "called when a _new_ single pfn is established"
>>>
>>> We need to define "new" then..  But I agree it's not crystal clear at
>>> least.  I think I just wasn't the first to assume it was reserved, see this
>>> (especially, the "Expectation" part..):
>>>
>>> commit 5180da410db6369d1f95c9014da1c9bc33fb043e
>>> Author: Suresh Siddha <suresh.b.siddha@intel.com>
>>> Date:   Mon Oct 8 16:28:29 2012 -0700
>>>
>>>       x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn
>>>       With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
>>>       attribute and uses it.  Expectation is that the driver reserves the
>>>       memory attributes for the pfn before calling vm_insert_pfn().
>>
>> It's all confusing.
>>
>> We do have the following functions relevant in pat code:
>>
>> (1) memtype_reserve(): used by ioremap and set_memory_XX
>>
>> (2) memtype_reserve_io(): used by iomap
>>
>> (3) reserve_pfn_range(): only remap_pfn_range() calls it
>>
>> (4) arch_io_reserve_memtype_wc()
>>
>>
>> Which one would perform the reservation for, say, vfio?
> 
> My understanding is it was done via barmap.  See this stack:
> 
> vfio_pci_core_mmap
>    pci_iomap
>      pci_iomap_range
>        ...
>          __ioremap_caller
>            memtype_reserve
> 
>>
>>
>> I agree that if there would be a guarantee/expectation that all PFNs have
>> the same memtype (from previous reservation), it would be sufficient to
>> check a single PFN, and we could document that. I just don't easily see
>> where that reservation is happening.
>>
>> So a pointer to that would be appreciated!
> 
> I am not aware of any pointer.. maybe others could chime in.
> 
> IMHO, if there's anything uncertain, for this one we could always decouple
> this issue from the core issue you're working on, so at least it keeps the
> old behavior (which is pure lookup on pfn injections) until a solid issue
> occurs?  It avoids the case where we could introduce unnecessary code but
> then it's much harder to justify a removal.  What do you think?

I'll use the _pfn variant and document the behavior.

I do wonder why we even have to lookup the memtype again if the caller 
apparently reserved it (which implied checking it). All a bit weird.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()
  2025-04-29 16:25                 ` David Hildenbrand
@ 2025-04-29 16:36                   ` Peter Xu
  0 siblings, 0 replies; 59+ messages in thread
From: Peter Xu @ 2025-04-29 16:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato

On Tue, Apr 29, 2025 at 06:25:06PM +0200, David Hildenbrand wrote:
> I do wonder why we even have to lookup the memtype again if the caller
> apparently reserved it (which implied checking it). All a bit weird.

Maybe it's because the memtype info isn't always visible to the upper
layers, e.g. default pci_iomap() for MMIOs doesn't need to specify anything
on cache mode.  There's some pci_iomap_wc() variance, but still looks like
only the internal has full control..

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH
  2025-04-28 20:23   ` Lorenzo Stoakes
@ 2025-05-05 12:10     ` David Hildenbrand
  2025-05-06  9:30       ` Lorenzo Stoakes
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-05-05 12:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On 28.04.25 22:23, Lorenzo Stoakes wrote:
> On Fri, Apr 25, 2025 at 10:17:13AM +0200, David Hildenbrand wrote:
>> The "memramp() shrinking" scenario no longer applies, so let's remove
>> that now-unnecessary handling.
> 
> I wonder if we could remove even more of the code here given the
> simplifications here? But not a big deal.

It might make sense to inline memtype_match().

diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
index 9d03f0dbc4715..e5844ed1311ed 100644
--- a/arch/x86/mm/pat/memtype_interval.c
+++ b/arch/x86/mm/pat/memtype_interval.c
@@ -49,21 +49,6 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
  
  static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
  
-static struct memtype *memtype_match(u64 start, u64 end)
-{
-       struct memtype *entry_match;
-
-       entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
-
-       while (entry_match != NULL && entry_match->start < end) {
-               if (entry_match->start == start && entry_match->end == end)
-                       return entry_match;
-               entry_match = interval_iter_next(entry_match, start, end-1);
-       }
-
-       return NULL; /* Returns NULL if there is no match */
-}
-
  static int memtype_check_conflict(u64 start, u64 end,
                                   enum page_cache_mode reqtype,
                                   enum page_cache_mode *newtype)
@@ -119,14 +104,16 @@ int memtype_check_insert(struct memtype *entry_new, enum page_cache_mode *ret_ty
  
  struct memtype *memtype_erase(u64 start, u64 end)
  {
-       struct memtype *entry_old;
-
-       entry_old = memtype_match(start, end);
-       if (!entry_old)
-               return ERR_PTR(-EINVAL);
-
-       interval_remove(entry_old, &memtype_rbroot);
-       return entry_old;
+       struct memtype *entry = interval_iter_first(&memtype_rbroot, start, end - 1);
+
+       while (entry && entry->start < end) {
+               if (entry->start == start && entry->end == end) {
+                       interval_remove(entry, &memtype_rbroot);
+                       return entry;
+               }
+               entry = interval_iter_next(entry, start, end - 1);
+       }
+       return ERR_PTR(-EINVAL);
  }
  
  struct memtype *memtype_lookup(u64 addr)


Thanks for all the review!

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-04-28 20:10   ` Lorenzo Stoakes
@ 2025-05-05 13:00     ` David Hildenbrand
  2025-05-07 13:25       ` David Hildenbrand
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-05-05 13:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu,
	Suren Baghdasaryan

>>
>> This change implies that we'll keep tracking the original PFN range even
>> after splitting + partially unmapping it: not too bad, because it was
>> not working reliably before. The only thing that kind-of worked before
>> was shrinking such a mapping using mremap(): we managed to adjust the
>> reservation in a hacky way, now we won't adjust the reservation but
>> leave it around until all involved VMAs are gone.
> 
> Hm, but what if we shrink a VMA, then map another one, might it be
> incorrectly storing PAT attributes for part of the range that is now mapped
> elsewhere?

Not "incorrectly". We'll simply undo the reservation of the cachemode 
for the original PFN range once everything of the original VMA is gone.

AFAIK, one can usually mmap() the "unmapped" part after shrinking again 
with the same cachemode, which should be the main use case.

Supporting partial un-tracking will require hooking into vma splitting 
code ... not something I am super happy about. :)

> 
> Also my god re: the 'kind of working' aspects of PAT, so frustrating.
> 
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Generally looking good, afaict, but maybe let's get some input from Suren
> on VMA size.
> 
> Are there actually any PAT tests out here? I had a quick glance in
> tools/testing/selftests/x86,mm and couldn't find any, but didn't look
> _that_ card.

Heh, booting a simple VM gets PAT involved. I suspect because /dev/mem 
and BIOS/GPU/whatever hacks.

In the cover letter I have

"Briefly tested with some basic /dev/mem test I crafted. I want to 
convert them to selftests, but that might or might not require a bit of
more work (e.g., /dev/mem accessibility)."

> 
> Thanks in general for tackling this, this is a big improvement!
> 
>> ---
>>   include/linux/mm_inline.h |  2 +
>>   include/linux/mm_types.h  | 11 ++++++
>>   kernel/fork.c             | 54 ++++++++++++++++++++++++--
>>   mm/memory.c               | 81 +++++++++++++++++++++++++++++++--------
>>   mm/mremap.c               |  4 --
>>   5 files changed, 128 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index f9157a0c42a5c..89b518ff097e6 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>
>>   #endif  /* CONFIG_ANON_VMA_NAME */
>>
>> +void pfnmap_track_ctx_release(struct kref *ref);
>> +
>>   static inline void init_tlb_flush_pending(struct mm_struct *mm)
>>   {
>>   	atomic_set(&mm->tlb_flush_pending, 0);
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 56d07edd01f91..91124761cfda8 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -764,6 +764,14 @@ struct vma_numab_state {
>>   	int prev_scan_seq;
>>   };
>>
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +struct pfnmap_track_ctx {
>> +	struct kref kref;
>> +	unsigned long pfn;
>> +	unsigned long size;
> 
> Again, (super) nitty, but we really should express units. I suppose 'size'
> implies bytes to be honest as you'd unlikely say 'size' for number of pages
> (you'd go with nr_pages or something). But maybe a trailing /* in bytes */
> would help.
> 
> Not a big deal though!

"size" in the kernel is usually bytes, never pages ... but I might be wrong.

Anyhow, I can use "/* in bytes*/" here, although I doubt that many will 
benefit from this comment :)

> 
>> +};
>> +#endif
>> +
>>   /*
>>    * This struct describes a virtual memory area. There is one of these
>>    * per VM-area/task. A VM area is any part of the process virtual memory
>> @@ -877,6 +885,9 @@ struct vm_area_struct {
>>   	struct anon_vma_name *anon_name;
>>   #endif
>>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +#ifdef __HAVE_PFNMAP_TRACKING
> 
> An aside, but absolutely hate '__HAVE_PFNMAP_TRACKING' as a name here. But
> you didn't create it, and it's not really sensible to change it in this
> series so. Just a rumble...

I cannot argue with that ... same here.

To be clear: I hate all of this with passion ;) With this series, I hate 
it a bit less.

[...]

> 
> Obviously my series will break this but should be _fairly_ trivial to
> update.
> 
> You will however have to make sure to update tools/testing/vma/* to handle
> the new functions in userland testing (they need to be stubbed otu).

Ah, I was happy it compiled but looks like I'll have to mess with that 
as well.

> 
> If it makes life easier, you can even send it to me off-list, or just send
> it without changing this in a respin and I can fix it up fairly quick for
> you.

Let me give it a try first, I'll let you know if it takes me too long.

Thanks!

[...]

>>   /**
>>    * remap_pfn_range - remap kernel memory to userspace
>>    * @vma: user vma to map to
>> @@ -2883,20 +2902,50 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>>    *
>>    * Return: %0 on success, negative error code otherwise.
>>    */
>> +#ifdef __HAVE_PFNMAP_TRACKING
>>   int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>>   		    unsigned long pfn, unsigned long size, pgprot_t prot)
> 
> OK so to expose some of my lack-of-knowledge of PAT - is this the
> 'entrypoint' to PAT tracking?

Only if you're using remap_pfn_range() ... there is other low-level 
tracking/reservation using the memtype_reserve() interface and friends.

> 
> So we have some kernel memory we remap to userland as PFN map, the kind
> that very well might be sensible to use PAT the change cache behaviour for,
> and each time this happens, it's mapped as PAT?

Right, anytime someone uses remap_pfn_range() on the full VMA, we track 
it (depending on RAM vs. !RAM this "tracking" has different semantics).

For RAM, we seem to only lookup the cachemode. For !RAM, we seem to 
reserve the memtype for the PFN range, which will fail if there already 
is an incompatible memtype reserved.

It's all ... very weird.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH
  2025-05-05 12:10     ` David Hildenbrand
@ 2025-05-06  9:30       ` Lorenzo Stoakes
  0 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-05-06  9:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu

On Mon, May 05, 2025 at 02:10:53PM +0200, David Hildenbrand wrote:
> On 28.04.25 22:23, Lorenzo Stoakes wrote:
> > On Fri, Apr 25, 2025 at 10:17:13AM +0200, David Hildenbrand wrote:
> > > The "memramp() shrinking" scenario no longer applies, so let's remove
> > > that now-unnecessary handling.
> >
> > I wonder if we could remove even more of the code here given the
> > simplifications here? But not a big deal.
>
> It might make sense to inline memtype_match().
>
> diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
> index 9d03f0dbc4715..e5844ed1311ed 100644
> --- a/arch/x86/mm/pat/memtype_interval.c
> +++ b/arch/x86/mm/pat/memtype_interval.c
> @@ -49,21 +49,6 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
>  static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
> -static struct memtype *memtype_match(u64 start, u64 end)
> -{
> -       struct memtype *entry_match;
> -
> -       entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
> -
> -       while (entry_match != NULL && entry_match->start < end) {
> -               if (entry_match->start == start && entry_match->end == end)
> -                       return entry_match;
> -               entry_match = interval_iter_next(entry_match, start, end-1);
> -       }
> -
> -       return NULL; /* Returns NULL if there is no match */
> -}
> -
>  static int memtype_check_conflict(u64 start, u64 end,
>                                   enum page_cache_mode reqtype,
>                                   enum page_cache_mode *newtype)
> @@ -119,14 +104,16 @@ int memtype_check_insert(struct memtype *entry_new, enum page_cache_mode *ret_ty
>  struct memtype *memtype_erase(u64 start, u64 end)
>  {
> -       struct memtype *entry_old;
> -
> -       entry_old = memtype_match(start, end);
> -       if (!entry_old)
> -               return ERR_PTR(-EINVAL);
> -
> -       interval_remove(entry_old, &memtype_rbroot);
> -       return entry_old;
> +       struct memtype *entry = interval_iter_first(&memtype_rbroot, start, end - 1);
> +
> +       while (entry && entry->start < end) {
> +               if (entry->start == start && entry->end == end) {
> +                       interval_remove(entry, &memtype_rbroot);
> +                       return entry;
> +               }
> +               entry = interval_iter_next(entry, start, end - 1);
> +       }
> +       return ERR_PTR(-EINVAL);
>  }
>  struct memtype *memtype_lookup(u64 addr)
>
>
> Thanks for all the review!

You're welcome :)

I _think_ I'm all caught up on my side of this review, ping me if there's
anything more you need from me.

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-05-05 13:00     ` David Hildenbrand
@ 2025-05-07 13:25       ` David Hildenbrand
  2025-05-07 14:27         ` Lorenzo Stoakes
  0 siblings, 1 reply; 59+ messages in thread
From: David Hildenbrand @ 2025-05-07 13:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu,
	Suren Baghdasaryan

>>
>> Obviously my series will break this but should be _fairly_ trivial to
>> update.
>>
>> You will however have to make sure to update tools/testing/vma/* to handle
>> the new functions in userland testing (they need to be stubbed otu).

Hmm, seems to compile. I guess, because we won't have
"__HAVE_PFNMAP_TRACKING" defined in the test environment, so
the existing stubs in there already seem to do the trick.


+#else
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+              struct vm_area_struct *new)
+{
+      return 0;
  }
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+}
+#endif


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v1 05/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
  2025-05-07 13:25       ` David Hildenbrand
@ 2025-05-07 14:27         ` Lorenzo Stoakes
  0 siblings, 0 replies; 59+ messages in thread
From: Lorenzo Stoakes @ 2025-05-07 14:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
	linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu,
	Suren Baghdasaryan

On Wed, May 07, 2025 at 03:25:42PM +0200, David Hildenbrand wrote:
> > >
> > > Obviously my series will break this but should be _fairly_ trivial to
> > > update.
> > >
> > > You will however have to make sure to update tools/testing/vma/* to handle
> > > the new functions in userland testing (they need to be stubbed otu).
>
> Hmm, seems to compile. I guess, because we won't have
> "__HAVE_PFNMAP_TRACKING" defined in the test environment, so
> the existing stubs in there already seem to do the trick.
>
>
> +#else
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> +              struct vm_area_struct *new)
> +{
> +      return 0;
>  }
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +}
> +#endif
>

OK cool! Then we're good :>)

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-05-07 14:34 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25  8:17 [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
2025-04-25  8:17 ` [PATCH v1 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
2025-04-28 16:16   ` Lorenzo Stoakes
2025-04-28 16:19     ` David Hildenbrand
2025-04-25  8:17 ` [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot() David Hildenbrand
2025-04-25 19:31   ` Peter Xu
2025-04-25 19:48     ` David Hildenbrand
2025-04-25 23:59       ` Peter Xu
2025-04-28 14:58         ` David Hildenbrand
2025-04-28 16:21           ` Peter Xu
2025-04-28 20:37             ` David Hildenbrand
2025-04-29 13:44               ` Peter Xu
2025-04-29 16:25                 ` David Hildenbrand
2025-04-29 16:36                   ` Peter Xu
2025-04-25 19:56     ` David Hildenbrand
2025-04-25  8:17 ` [PATCH v1 03/11] x86/mm/pat: introduce pfnmap_track() and pfnmap_untrack() David Hildenbrand
2025-04-28 16:53   ` Lorenzo Stoakes
2025-04-28 17:12     ` David Hildenbrand
2025-04-28 18:58       ` Lorenzo Stoakes
2025-04-25  8:17 ` [PATCH v1 04/11] mm/memremap: convert to pfnmap_track() + pfnmap_untrack() David Hildenbrand
2025-04-25 20:00   ` Peter Xu
2025-04-25 20:14     ` David Hildenbrand
2025-04-28 16:54     ` Lorenzo Stoakes
2025-04-28 17:07     ` Lorenzo Stoakes
2025-04-25  8:17 ` [PATCH v1 05/11] mm: convert VM_PFNMAP tracking " David Hildenbrand
2025-04-25 20:23   ` Peter Xu
2025-04-25 20:36     ` David Hildenbrand
2025-04-28 16:08       ` Peter Xu
2025-04-28 16:16         ` David Hildenbrand
2025-04-28 16:24           ` Peter Xu
2025-04-28 17:23             ` David Hildenbrand
2025-04-28 19:37               ` Lorenzo Stoakes
2025-04-28 19:57                 ` Suren Baghdasaryan
2025-04-28 20:23                   ` David Hildenbrand
2025-04-28 20:19                 ` David Hildenbrand
2025-04-28 19:38   ` Lorenzo Stoakes
2025-04-28 20:00     ` Suren Baghdasaryan
2025-04-28 20:21       ` David Hildenbrand
2025-04-28 20:10   ` Lorenzo Stoakes
2025-05-05 13:00     ` David Hildenbrand
2025-05-07 13:25       ` David Hildenbrand
2025-05-07 14:27         ` Lorenzo Stoakes
2025-04-25  8:17 ` [PATCH v1 06/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
2025-04-28 20:12   ` Lorenzo Stoakes
2025-04-25  8:17 ` [PATCH v1 07/11] mm: remove VM_PAT David Hildenbrand
2025-04-28 20:16   ` Lorenzo Stoakes
2025-04-25  8:17 ` [PATCH v1 08/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
2025-04-28 20:18   ` Lorenzo Stoakes
2025-04-25  8:17 ` [PATCH v1 09/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
2025-04-28 20:23   ` Lorenzo Stoakes
2025-05-05 12:10     ` David Hildenbrand
2025-05-06  9:30       ` Lorenzo Stoakes
2025-04-25  8:17 ` [PATCH v1 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
2025-04-28 20:23   ` Lorenzo Stoakes
2025-04-25  8:17 ` [PATCH v1 11/11] mm/io-mapping: " David Hildenbrand
2025-04-28 16:06   ` Lorenzo Stoakes
2025-04-28 16:14     ` David Hildenbrand
2025-04-25  8:54 ` [PATCH v1 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Ingo Molnar
2025-04-25  9:27   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).