* [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
@ 2025-05-12 12:34 David Hildenbrand
2025-05-12 12:34 ` [PATCH v2 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
` (11 more replies)
0 siblings, 12 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu
On top of mm-unstable.
VM_PAT annoyed me too much and wasted too much of my time, let's clean
PAT handling up and remove VM_PAT.
This should sort out various issues with VM_PAT we discovered recently,
and will hopefully make the whole code more stable and easier to maintain.
In essence: we stop letting PAT mode mess with VMAs and instead lift
what to track/untrack to the MM core. We remember per VMA which pfn range
we tracked in a new struct we attach to a VMA (we have space without
exceeding 192 bytes), use a kref to share it among VMAs during
split/mremap/fork, and automatically untrack once the kref drops to 0.
This implies that we'll keep tracking a full pfn range even after partially
unmapping it, until fully unmapping it; but as that case was mostly broken
before, this at least makes it work in a way that is least intrusive to
VMA handling.
Shrinking with mremap() used to work in a hacky way, now we'll similarly
keep the original pfn range tacked even after this form of partial unmap.
Does anybody care about that? Unlikely. If we run into issues, we could
likely handled that (adjust the tracking) when our kref drops to 1 while
freeing a VMA. But it adds more complexity, so avoid that for now.
Briefly tested with the new pfnmap selftests [1].
[1] https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Peter Xu <peterx@redhat.com>
v1 -> v2:
* "mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()"
-> Call it "pfnmap_setup_cachemode()" and improve the documentation
-> Add pfnmap_setup_cachemode_pfn()
-> Keep checking a single PFN for PMD/PUD case and document why it's ok
* Merged memremap conversion patch with pfnmap_track() introduction patch
-> Improve documentation
* "mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()"
-> Adjust to code changes in mm-unstable
* Added "x86/mm/pat: inline memtype_match() into memtype_erase()"
* "mm/io-mapping: track_pfn() -> "pfnmap tracking""
-> Adjust to code changes in mm-unstable
David Hildenbrand (11):
x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()
mm: introduce pfnmap_track() and pfnmap_untrack() and use them for
memremap
mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
x86/mm/pat: remove old pfnmap tracking interface
mm: remove VM_PAT
x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
x86/mm/pat: remove MEMTYPE_*_MATCH
x86/mm/pat: inline memtype_match() into memtype_erase()
drm/i915: track_pfn() -> "pfnmap tracking"
mm/io-mapping: track_pfn() -> "pfnmap tracking"
arch/x86/mm/pat/memtype.c | 194 ++++-------------------------
arch/x86/mm/pat/memtype_interval.c | 63 ++--------
drivers/gpu/drm/i915/i915_mm.c | 4 +-
include/linux/mm.h | 4 +-
include/linux/mm_inline.h | 2 +
include/linux/mm_types.h | 11 ++
include/linux/pgtable.h | 127 ++++++++++---------
include/trace/events/mmflags.h | 4 +-
mm/huge_memory.c | 5 +-
mm/io-mapping.c | 2 +-
mm/memory.c | 86 ++++++++++---
mm/memremap.c | 8 +-
mm/mmap.c | 5 -
mm/mremap.c | 4 -
mm/vma_init.c | 50 ++++++++
15 files changed, 242 insertions(+), 327 deletions(-)
base-commit: c68cfbc5048ede4b10a1d3fe16f7f6192fc2c9c8
--
2.49.0
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v2 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-13 17:29 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*() David Hildenbrand
` (10 subsequent siblings)
11 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
Let's factor it out to make the code easier to grasp. Drop one comment
where it is now rather obvious what is happening.
Use it also in pgprot_writecombine()/pgprot_writethrough() where
clearing the old cachemode might not be required, but given that we are
already doing a function call, no need to care about this
micro-optimization.
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/x86/mm/pat/memtype.c | 33 +++++++++++++++------------------
1 file changed, 15 insertions(+), 18 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 72d8cbc611583..edec5859651d6 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -800,6 +800,12 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size)
}
#endif /* CONFIG_STRICT_DEVMEM */
+static inline void pgprot_set_cachemode(pgprot_t *prot, enum page_cache_mode pcm)
+{
+ *prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) |
+ cachemode2protval(pcm));
+}
+
int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
unsigned long size, pgprot_t *vma_prot)
{
@@ -811,8 +817,7 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
if (file->f_flags & O_DSYNC)
pcm = _PAGE_CACHE_MODE_UC_MINUS;
- *vma_prot = __pgprot((pgprot_val(*vma_prot) & ~_PAGE_CACHE_MASK) |
- cachemode2protval(pcm));
+ pgprot_set_cachemode(vma_prot, pcm);
return 1;
}
@@ -880,9 +885,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
(unsigned long long)paddr,
(unsigned long long)(paddr + size - 1),
cattr_name(pcm));
- *vma_prot = __pgprot((pgprot_val(*vma_prot) &
- (~_PAGE_CACHE_MASK)) |
- cachemode2protval(pcm));
+ pgprot_set_cachemode(vma_prot, pcm);
}
return 0;
}
@@ -907,9 +910,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
* We allow returning different type than the one requested in
* non strict case.
*/
- *vma_prot = __pgprot((pgprot_val(*vma_prot) &
- (~_PAGE_CACHE_MASK)) |
- cachemode2protval(pcm));
+ pgprot_set_cachemode(vma_prot, pcm);
}
if (memtype_kernel_map_sync(paddr, size, pcm) < 0) {
@@ -1060,9 +1061,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
return -EINVAL;
}
- *prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
- cachemode2protval(pcm));
-
+ pgprot_set_cachemode(prot, pcm);
return 0;
}
@@ -1073,10 +1072,8 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
if (!pat_enabled())
return;
- /* Set prot based on lookup */
pcm = lookup_memtype(pfn_t_to_phys(pfn));
- *prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
- cachemode2protval(pcm));
+ pgprot_set_cachemode(prot, pcm);
}
/*
@@ -1115,15 +1112,15 @@ void untrack_pfn_clear(struct vm_area_struct *vma)
pgprot_t pgprot_writecombine(pgprot_t prot)
{
- return __pgprot(pgprot_val(prot) |
- cachemode2protval(_PAGE_CACHE_MODE_WC));
+ pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC);
+ return prot;
}
EXPORT_SYMBOL_GPL(pgprot_writecombine);
pgprot_t pgprot_writethrough(pgprot_t prot)
{
- return __pgprot(pgprot_val(prot) |
- cachemode2protval(_PAGE_CACHE_MODE_WT));
+ pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WT);
+ return prot;
}
EXPORT_SYMBOL_GPL(pgprot_writethrough);
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
2025-05-12 12:34 ` [PATCH v2 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-12 15:43 ` Lorenzo Stoakes
2025-05-13 17:29 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 03/11] mm: introduce pfnmap_track() and pfnmap_untrack() and use them for memremap David Hildenbrand
` (9 subsequent siblings)
11 siblings, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
... by factoring it out from track_pfn_remap() into
pfnmap_setup_cachemode() and provide pfnmap_setup_cachemode_pfn() as
a replacement for track_pfn_insert().
For PMDs/PUDs, we keep checking a single pfn only. Add some documentation,
and also document why it is valid to not check the whole pfn range.
We'll reuse pfnmap_setup_cachemode() from core MM next.
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/x86/mm/pat/memtype.c | 24 ++++++------------
include/linux/pgtable.h | 52 +++++++++++++++++++++++++++++++++------
mm/huge_memory.c | 5 ++--
mm/memory.c | 4 +--
4 files changed, 57 insertions(+), 28 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index edec5859651d6..fa78facc6f633 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1031,7 +1031,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
unsigned long pfn, unsigned long addr, unsigned long size)
{
resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
- enum page_cache_mode pcm;
/* reserve the whole chunk starting from paddr */
if (!vma || (addr == vma->vm_start
@@ -1044,13 +1043,17 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
return ret;
}
+ return pfnmap_setup_cachemode(pfn, size, prot);
+}
+
+int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot)
+{
+ resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
+ enum page_cache_mode pcm;
+
if (!pat_enabled())
return 0;
- /*
- * For anything smaller than the vma size we set prot based on the
- * lookup.
- */
pcm = lookup_memtype(paddr);
/* Check memtype for the remaining pages */
@@ -1065,17 +1068,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
return 0;
}
-void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
-{
- enum page_cache_mode pcm;
-
- if (!pat_enabled())
- return;
-
- pcm = lookup_memtype(pfn_t_to_phys(pfn));
- pgprot_set_cachemode(prot, pcm);
-}
-
/*
* untrack_pfn is called while unmapping a pfnmap for a region.
* untrack can be called for a specific region indicated by pfn and size or
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f1e890b604609..be1745839871c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1496,13 +1496,10 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
return 0;
}
-/*
- * track_pfn_insert is called when a _new_ single pfn is established
- * by vmf_insert_pfn().
- */
-static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
- pfn_t pfn)
+static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
+ pgprot_t *prot)
{
+ return 0;
}
/*
@@ -1552,8 +1549,32 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
unsigned long pfn, unsigned long addr,
unsigned long size);
-extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
- pfn_t pfn);
+
+/**
+ * pfnmap_setup_cachemode - setup the cachemode in the pgprot for a pfn range
+ * @pfn: the start of the pfn range
+ * @size: the size of the pfn range in bytes
+ * @prot: the pgprot to modify
+ *
+ * Lookup the cachemode for the pfn range starting at @pfn with the size
+ * @size and store it in @prot, leaving other data in @prot unchanged.
+ *
+ * This allows for a hardware implementation to have fine-grained control of
+ * memory cache behavior at page level granularity. Without a hardware
+ * implementation, this function does nothing.
+ *
+ * Currently there is only one implementation for this - x86 Page Attribute
+ * Table (PAT). See Documentation/arch/x86/pat.rst for more details.
+ *
+ * This function can fail if the pfn range spans pfns that require differing
+ * cachemodes. If the pfn range was previously verified to have a single
+ * cachemode, it is sufficient to query only a single pfn. The assumption is
+ * that this is the case for drivers using the vmf_insert_pfn*() interface.
+ *
+ * Returns 0 on success and -EINVAL on error.
+ */
+int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
+ pgprot_t *prot);
extern int track_pfn_copy(struct vm_area_struct *dst_vma,
struct vm_area_struct *src_vma, unsigned long *pfn);
extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
@@ -1563,6 +1584,21 @@ extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
extern void untrack_pfn_clear(struct vm_area_struct *vma);
#endif
+/**
+ * pfnmap_setup_cachemode_pfn - setup the cachemode in the pgprot for a pfn
+ * @pfn: the pfn
+ * @prot: the pgprot to modify
+ *
+ * Lookup the cachemode for @pfn and store it in @prot, leaving other
+ * data in @prot unchanged.
+ *
+ * See pfnmap_setup_cachemode() for details.
+ */
+static inline void pfnmap_setup_cachemode_pfn(unsigned long pfn, pgprot_t *prot)
+{
+ pfnmap_setup_cachemode(pfn, PAGE_SIZE, prot);
+}
+
#ifdef CONFIG_MMU
#ifdef __HAVE_COLOR_ZERO_PAGE
static inline int is_zero_pfn(unsigned long pfn)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f01..d3e66136e41a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1455,7 +1455,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
return VM_FAULT_OOM;
}
- track_pfn_insert(vma, &pgprot, pfn);
+ pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
+
ptl = pmd_lock(vma->vm_mm, vmf->pmd);
error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
pgtable);
@@ -1577,7 +1578,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
if (addr < vma->vm_start || addr >= vma->vm_end)
return VM_FAULT_SIGBUS;
- track_pfn_insert(vma, &pgprot, pfn);
+ pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
ptl = pud_lock(vma->vm_mm, vmf->pud);
insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
diff --git a/mm/memory.c b/mm/memory.c
index 99af83434e7c5..064fc55d8eab9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2564,7 +2564,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
if (!pfn_modify_allowed(pfn, pgprot))
return VM_FAULT_SIGBUS;
- track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
+ pfnmap_setup_cachemode_pfn(pfn, &pgprot);
return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
false);
@@ -2627,7 +2627,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
if (addr < vma->vm_start || addr >= vma->vm_end)
return VM_FAULT_SIGBUS;
- track_pfn_insert(vma, &pgprot, pfn);
+ pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
return VM_FAULT_SIGBUS;
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 03/11] mm: introduce pfnmap_track() and pfnmap_untrack() and use them for memremap
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
2025-05-12 12:34 ` [PATCH v2 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
2025-05-12 12:34 ` [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*() David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-13 17:40 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack() David Hildenbrand
` (8 subsequent siblings)
11 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
Let's provide variants of track_pfn_remap() and untrack_pfn() that won't
mess with VMAs, and replace the usage in mm/memremap.c.
Add some documentation.
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/x86/mm/pat/memtype.c | 14 ++++++++++++++
include/linux/pgtable.h | 39 +++++++++++++++++++++++++++++++++++++++
mm/memremap.c | 8 ++++----
3 files changed, 57 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index fa78facc6f633..1ec8af6cad6bf 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1068,6 +1068,20 @@ int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot
return 0;
}
+int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
+{
+ const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
+
+ return reserve_pfn_range(paddr, size, prot, 0);
+}
+
+void pfnmap_untrack(unsigned long pfn, unsigned long size)
+{
+ const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
+
+ free_pfn_range(paddr, size);
+}
+
/*
* untrack_pfn is called while unmapping a pfnmap for a region.
* untrack can be called for a specific region indicated by pfn and size or
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index be1745839871c..90f72cd358390 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1502,6 +1502,16 @@ static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
return 0;
}
+static inline int pfnmap_track(unsigned long pfn, unsigned long size,
+ pgprot_t *prot)
+{
+ return 0;
+}
+
+static inline void pfnmap_untrack(unsigned long pfn, unsigned long size)
+{
+}
+
/*
* track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
* tables copied during copy_page_range(). Will store the pfn to be
@@ -1575,6 +1585,35 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
*/
int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
pgprot_t *prot);
+
+/**
+ * pfnmap_track - track a pfn range
+ * @pfn: the start of the pfn range
+ * @size: the size of the pfn range in bytes
+ * @prot: the pgprot to track
+ *
+ * Requested the pfn range to be 'tracked' by a hardware implementation and
+ * setup the cachemode in @prot similar to pfnmap_setup_cachemode().
+ *
+ * This allows for fine-grained control of memory cache behaviour at page
+ * level granularity. Tracking memory this way is persisted across VMA splits
+ * (VMA merging does not apply for VM_PFNMAP).
+ *
+ * Currently, there is only one implementation for this - x86 Page Attribute
+ * Table (PAT). See Documentation/arch/x86/pat.rst for more details.
+ *
+ * Returns 0 on success and -EINVAL on error.
+ */
+int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
+
+/**
+ * pfnmap_untrack - untrack a pfn range
+ * @pfn: the start of the pfn range
+ * @size: the size of the pfn range in bytes
+ *
+ * Untrack a pfn range previously tracked through pfnmap_track().
+ */
+void pfnmap_untrack(unsigned long pfn, unsigned long size);
extern int track_pfn_copy(struct vm_area_struct *dst_vma,
struct vm_area_struct *src_vma, unsigned long *pfn);
extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
diff --git a/mm/memremap.c b/mm/memremap.c
index 2aebc1b192da9..c417c843e9b1f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
}
mem_hotplug_done();
- untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
+ pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
pgmap_array_delete(range);
}
@@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
if (nid < 0)
nid = numa_mem_id();
- error = track_pfn_remap(NULL, ¶ms->pgprot, PHYS_PFN(range->start), 0,
- range_len(range));
+ error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
+ ¶ms->pgprot);
if (error)
goto err_pfn_remap;
@@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
if (!is_private)
kasan_remove_zero_shadow(__va(range->start), range_len(range));
err_kasan:
- untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
+ pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
err_pfn_remap:
pgmap_array_delete(range);
return error;
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (2 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 03/11] mm: introduce pfnmap_track() and pfnmap_untrack() and use them for memremap David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-12 16:42 ` Lorenzo Stoakes
2025-05-13 17:42 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 05/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
` (7 subsequent siblings)
11 siblings, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
Let's use our new interface. In remap_pfn_range(), we'll now decide
whether we have to track (full VMA covered) or only lookup the
cachemode (partial VMA covered).
Remember what we have to untrack by linking it from the VMA. When
duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
to anon VMA names, and use a kref to share the tracking.
Once the last VMA un-refs our tracking data, we'll do the untracking,
which simplifies things a lot and should sort our various issues we saw
recently, for example, when partially unmapping/zapping a tracked VMA.
This change implies that we'll keep tracking the original PFN range even
after splitting + partially unmapping it: not too bad, because it was
not working reliably before. The only thing that kind-of worked before
was shrinking such a mapping using mremap(): we managed to adjust the
reservation in a hacky way, now we won't adjust the reservation but
leave it around until all involved VMAs are gone.
If that ever turns out to be an issue, we could hook into VM splitting
code and split the tracking; however, that adds complexity that might
not be required, so we'll keep it simple for now.
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/mm_inline.h | 2 +
include/linux/mm_types.h | 11 ++++++
mm/memory.c | 82 +++++++++++++++++++++++++++++++--------
mm/mmap.c | 5 ---
mm/mremap.c | 4 --
mm/vma_init.c | 50 ++++++++++++++++++++++++
6 files changed, 129 insertions(+), 25 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f9157a0c42a5c..89b518ff097e6 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
#endif /* CONFIG_ANON_VMA_NAME */
+void pfnmap_track_ctx_release(struct kref *ref);
+
static inline void init_tlb_flush_pending(struct mm_struct *mm)
{
atomic_set(&mm->tlb_flush_pending, 0);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 15808cad2bc1a..3e934dc6057c4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -763,6 +763,14 @@ struct vma_numab_state {
int prev_scan_seq;
};
+#ifdef __HAVE_PFNMAP_TRACKING
+struct pfnmap_track_ctx {
+ struct kref kref;
+ unsigned long pfn;
+ unsigned long size; /* in bytes */
+};
+#endif
+
/*
* Describes a VMA that is about to be mmap()'ed. Drivers may choose to
* manipulate mutable fields which will cause those fields to be updated in the
@@ -900,6 +908,9 @@ struct vm_area_struct {
struct anon_vma_name *anon_name;
#endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef __HAVE_PFNMAP_TRACKING
+ struct pfnmap_track_ctx *pfnmap_track_ctx;
+#endif
} __randomize_layout;
#ifdef CONFIG_NUMA
diff --git a/mm/memory.c b/mm/memory.c
index 064fc55d8eab9..4cf4adb0de266 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1371,7 +1371,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
struct mm_struct *dst_mm = dst_vma->vm_mm;
struct mm_struct *src_mm = src_vma->vm_mm;
struct mmu_notifier_range range;
- unsigned long next, pfn = 0;
+ unsigned long next;
bool is_cow;
int ret;
@@ -1381,12 +1381,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
if (is_vm_hugetlb_page(src_vma))
return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
- if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
- ret = track_pfn_copy(dst_vma, src_vma, &pfn);
- if (ret)
- return ret;
- }
-
/*
* We need to invalidate the secondary MMU mappings only when
* there could be a permission downgrade on the ptes of the
@@ -1428,8 +1422,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
raw_write_seqcount_end(&src_mm->write_protect_seq);
mmu_notifier_invalidate_range_end(&range);
}
- if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
- untrack_pfn_copy(dst_vma, pfn);
return ret;
}
@@ -1924,9 +1916,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
if (vma->vm_file)
uprobe_munmap(vma, start, end);
- if (unlikely(vma->vm_flags & VM_PFNMAP))
- untrack_pfn(vma, 0, 0, mm_wr_locked);
-
if (start != end) {
if (unlikely(is_vm_hugetlb_page(vma))) {
/*
@@ -2872,6 +2861,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
return error;
}
+#ifdef __HAVE_PFNMAP_TRACKING
+static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
+ unsigned long size, pgprot_t *prot)
+{
+ struct pfnmap_track_ctx *ctx;
+
+ if (pfnmap_track(pfn, size, prot))
+ return ERR_PTR(-EINVAL);
+
+ ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+ if (unlikely(!ctx)) {
+ pfnmap_untrack(pfn, size);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ ctx->pfn = pfn;
+ ctx->size = size;
+ kref_init(&ctx->kref);
+ return ctx;
+}
+
+void pfnmap_track_ctx_release(struct kref *ref)
+{
+ struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
+
+ pfnmap_untrack(ctx->pfn, ctx->size);
+ kfree(ctx);
+}
+#endif /* __HAVE_PFNMAP_TRACKING */
+
/**
* remap_pfn_range - remap kernel memory to userspace
* @vma: user vma to map to
@@ -2884,20 +2903,51 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
*
* Return: %0 on success, negative error code otherwise.
*/
+#ifdef __HAVE_PFNMAP_TRACKING
int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t prot)
{
+ struct pfnmap_track_ctx *ctx = NULL;
int err;
- err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
- if (err)
+ size = PAGE_ALIGN(size);
+
+ /*
+ * If we cover the full VMA, we'll perform actual tracking, and
+ * remember to untrack when the last reference to our tracking
+ * context from a VMA goes away. We'll keep tracking the whole pfn
+ * range even during VMA splits and partial unmapping.
+ *
+ * If we only cover parts of the VMA, we'll only setup the cachemode
+ * in the pgprot for the pfn range.
+ */
+ if (addr == vma->vm_start && addr + size == vma->vm_end) {
+ if (vma->pfnmap_track_ctx)
+ return -EINVAL;
+ ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+ } else if (pfnmap_setup_cachemode(pfn, size, &prot)) {
return -EINVAL;
+ }
err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
- if (err)
- untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
+ if (ctx) {
+ if (err)
+ kref_put(&ctx->kref, pfnmap_track_ctx_release);
+ else
+ vma->pfnmap_track_ctx = ctx;
+ }
return err;
}
+
+#else
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+}
+#endif
EXPORT_SYMBOL(remap_pfn_range);
/**
diff --git a/mm/mmap.c b/mm/mmap.c
index 50f902c08341a..09c563c951123 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1784,11 +1784,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
tmp = vm_area_dup(mpnt);
if (!tmp)
goto fail_nomem;
-
- /* track_pfn_copy() will later take care of copying internal state. */
- if (unlikely(tmp->vm_flags & VM_PFNMAP))
- untrack_pfn_clear(tmp);
-
retval = vma_dup_policy(mpnt, tmp);
if (retval)
goto fail_nomem_policy;
diff --git a/mm/mremap.c b/mm/mremap.c
index 7db9da609c84f..6e78e02f74bd3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
if (is_vm_hugetlb_page(vma))
clear_vma_resv_huge_pages(vma);
- /* Tell pfnmap has moved from this vma */
- if (unlikely(vma->vm_flags & VM_PFNMAP))
- untrack_pfn_clear(vma);
-
*new_vma_ptr = new_vma;
return err;
}
diff --git a/mm/vma_init.c b/mm/vma_init.c
index 967ca85179864..8e53c7943561e 100644
--- a/mm/vma_init.c
+++ b/mm/vma_init.c
@@ -71,7 +71,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
#ifdef CONFIG_NUMA
dest->vm_policy = src->vm_policy;
#endif
+#ifdef __HAVE_PFNMAP_TRACKING
+ dest->pfnmap_track_ctx = NULL;
+#endif
+}
+
+#ifdef __HAVE_PFNMAP_TRACKING
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+ struct vm_area_struct *new)
+{
+ struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
+
+ if (likely(!ctx))
+ return 0;
+
+ /*
+ * We don't expect to ever hit this. If ever required, we would have
+ * to duplicate the tracking.
+ */
+ if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
+ return -ENOMEM;
+ kref_get(&ctx->kref);
+ new->pfnmap_track_ctx = ctx;
+ return 0;
+}
+
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+ struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
+
+ if (likely(!ctx))
+ return;
+
+ kref_put(&ctx->kref, pfnmap_track_ctx_release);
+ vma->pfnmap_track_ctx = NULL;
+}
+#else
+static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
+ struct vm_area_struct *new)
+{
+ return 0;
}
+static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
+{
+}
+#endif
struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
{
@@ -83,6 +127,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
vm_area_init_from(orig, new);
+
+ if (vma_pfnmap_track_ctx_dup(orig, new)) {
+ kmem_cache_free(vm_area_cachep, new);
+ return NULL;
+ }
vma_lock_init(new, true);
INIT_LIST_HEAD(&new->anon_vma_chain);
vma_numab_state_init(new);
@@ -97,5 +146,6 @@ void vm_area_free(struct vm_area_struct *vma)
vma_assert_detached(vma);
vma_numab_state_free(vma);
free_anon_vma_name(vma);
+ vma_pfnmap_track_ctx_release(vma);
kmem_cache_free(vm_area_cachep, vma);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 05/11] x86/mm/pat: remove old pfnmap tracking interface
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (3 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack() David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-13 17:42 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 06/11] mm: remove VM_PAT David Hildenbrand
` (6 subsequent siblings)
11 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
We can now get rid of the old interface along with get_pat_info() and
follow_phys().
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/x86/mm/pat/memtype.c | 147 --------------------------------------
include/linux/pgtable.h | 66 -----------------
2 files changed, 213 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 1ec8af6cad6bf..c88d1cbdc1de1 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -933,119 +933,6 @@ static void free_pfn_range(u64 paddr, unsigned long size)
memtype_free(paddr, paddr + size);
}
-static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
- resource_size_t *phys)
-{
- struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start };
-
- if (follow_pfnmap_start(&args))
- return -EINVAL;
-
- /* Never return PFNs of anon folios in COW mappings. */
- if (!args.special) {
- follow_pfnmap_end(&args);
- return -EINVAL;
- }
-
- *prot = pgprot_val(args.pgprot);
- *phys = (resource_size_t)args.pfn << PAGE_SHIFT;
- follow_pfnmap_end(&args);
- return 0;
-}
-
-static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
- pgprot_t *pgprot)
-{
- unsigned long prot;
-
- VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT));
-
- /*
- * We need the starting PFN and cachemode used for track_pfn_remap()
- * that covered the whole VMA. For most mappings, we can obtain that
- * information from the page tables. For COW mappings, we might now
- * suddenly have anon folios mapped and follow_phys() will fail.
- *
- * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to
- * detect the PFN. If we need the cachemode as well, we're out of luck
- * for now and have to fail fork().
- */
- if (!follow_phys(vma, &prot, paddr)) {
- if (pgprot)
- *pgprot = __pgprot(prot);
- return 0;
- }
- if (is_cow_mapping(vma->vm_flags)) {
- if (pgprot)
- return -EINVAL;
- *paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT;
- return 0;
- }
- WARN_ON_ONCE(1);
- return -EINVAL;
-}
-
-int track_pfn_copy(struct vm_area_struct *dst_vma,
- struct vm_area_struct *src_vma, unsigned long *pfn)
-{
- const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
- resource_size_t paddr;
- pgprot_t pgprot;
- int rc;
-
- if (!(src_vma->vm_flags & VM_PAT))
- return 0;
-
- /*
- * Duplicate the PAT information for the dst VMA based on the src
- * VMA.
- */
- if (get_pat_info(src_vma, &paddr, &pgprot))
- return -EINVAL;
- rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
- if (rc)
- return rc;
-
- /* Reservation for the destination VMA succeeded. */
- vm_flags_set(dst_vma, VM_PAT);
- *pfn = PHYS_PFN(paddr);
- return 0;
-}
-
-void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
-{
- untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
- /*
- * Reservation was freed, any copied page tables will get cleaned
- * up later, but without getting PAT involved again.
- */
-}
-
-/*
- * prot is passed in as a parameter for the new mapping. If the vma has
- * a linear pfn mapping for the entire range, or no vma is provided,
- * reserve the entire pfn + size range with single reserve_pfn_range
- * call.
- */
-int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
- unsigned long pfn, unsigned long addr, unsigned long size)
-{
- resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
-
- /* reserve the whole chunk starting from paddr */
- if (!vma || (addr == vma->vm_start
- && size == (vma->vm_end - vma->vm_start))) {
- int ret;
-
- ret = reserve_pfn_range(paddr, size, prot, 0);
- if (ret == 0 && vma)
- vm_flags_set(vma, VM_PAT);
- return ret;
- }
-
- return pfnmap_setup_cachemode(pfn, size, prot);
-}
-
int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot)
{
resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
@@ -1082,40 +969,6 @@ void pfnmap_untrack(unsigned long pfn, unsigned long size)
free_pfn_range(paddr, size);
}
-/*
- * untrack_pfn is called while unmapping a pfnmap for a region.
- * untrack can be called for a specific region indicated by pfn and size or
- * can be for the entire vma (in which case pfn, size are zero).
- */
-void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
- unsigned long size, bool mm_wr_locked)
-{
- resource_size_t paddr;
-
- if (vma && !(vma->vm_flags & VM_PAT))
- return;
-
- /* free the chunk starting from pfn or the whole chunk */
- paddr = (resource_size_t)pfn << PAGE_SHIFT;
- if (!paddr && !size) {
- if (get_pat_info(vma, &paddr, NULL))
- return;
- size = vma->vm_end - vma->vm_start;
- }
- free_pfn_range(paddr, size);
- if (vma) {
- if (mm_wr_locked)
- vm_flags_clear(vma, VM_PAT);
- else
- __vm_flags_mod(vma, 0, VM_PAT);
- }
-}
-
-void untrack_pfn_clear(struct vm_area_struct *vma)
-{
- vm_flags_clear(vma, VM_PAT);
-}
-
pgprot_t pgprot_writecombine(pgprot_t prot)
{
pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 90f72cd358390..0b6e1f781d86d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1485,17 +1485,6 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
* vmf_insert_pfn.
*/
-/*
- * track_pfn_remap is called when a _new_ pfn mapping is being established
- * by remap_pfn_range() for physical range indicated by pfn and size.
- */
-static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
- unsigned long pfn, unsigned long addr,
- unsigned long size)
-{
- return 0;
-}
-
static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
pgprot_t *prot)
{
@@ -1511,55 +1500,7 @@ static inline int pfnmap_track(unsigned long pfn, unsigned long size,
static inline void pfnmap_untrack(unsigned long pfn, unsigned long size)
{
}
-
-/*
- * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
- * tables copied during copy_page_range(). Will store the pfn to be
- * passed to untrack_pfn_copy() only if there is something to be untracked.
- * Callers should initialize the pfn to 0.
- */
-static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
- struct vm_area_struct *src_vma, unsigned long *pfn)
-{
- return 0;
-}
-
-/*
- * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
- * copy_page_range(), but after track_pfn_copy() was already called. Can
- * be called even if track_pfn_copy() did not actually track anything:
- * handled internally.
- */
-static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
- unsigned long pfn)
-{
-}
-
-/*
- * untrack_pfn is called while unmapping a pfnmap for a region.
- * untrack can be called for a specific region indicated by pfn and size or
- * can be for the entire vma (in which case pfn, size are zero).
- */
-static inline void untrack_pfn(struct vm_area_struct *vma,
- unsigned long pfn, unsigned long size,
- bool mm_wr_locked)
-{
-}
-
-/*
- * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
- *
- * 1) During mremap() on the src VMA after the page tables were moved.
- * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
- */
-static inline void untrack_pfn_clear(struct vm_area_struct *vma)
-{
-}
#else
-extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
- unsigned long pfn, unsigned long addr,
- unsigned long size);
-
/**
* pfnmap_setup_cachemode - setup the cachemode in the pgprot for a pfn range
* @pfn: the start of the pfn range
@@ -1614,13 +1555,6 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
* Untrack a pfn range previously tracked through pfnmap_track().
*/
void pfnmap_untrack(unsigned long pfn, unsigned long size);
-extern int track_pfn_copy(struct vm_area_struct *dst_vma,
- struct vm_area_struct *src_vma, unsigned long *pfn);
-extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
- unsigned long pfn);
-extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
- unsigned long size, bool mm_wr_locked);
-extern void untrack_pfn_clear(struct vm_area_struct *vma);
#endif
/**
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 06/11] mm: remove VM_PAT
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (4 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 05/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-13 17:42 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 07/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
` (5 subsequent siblings)
11 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
It's unused, so let's remove it.
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/mm.h | 4 +---
include/trace/events/mmflags.h | 4 +---
2 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 38e16c984b9a6..c4efa9b17655e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -357,9 +357,7 @@ extern unsigned int kobjsize(const void *objp);
# define VM_SHADOW_STACK VM_NONE
#endif
-#if defined(CONFIG_X86)
-# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
-#elif defined(CONFIG_PPC64)
+#if defined(CONFIG_PPC64)
# define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
#elif defined(CONFIG_PARISC)
# define VM_GROWSUP VM_ARCH_1
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 15aae955a10bf..aa441f593e9a6 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -172,9 +172,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
__def_pageflag_names \
) : "none"
-#if defined(CONFIG_X86)
-#define __VM_ARCH_SPECIFIC_1 {VM_PAT, "pat" }
-#elif defined(CONFIG_PPC64)
+#if defined(CONFIG_PPC64)
#define __VM_ARCH_SPECIFIC_1 {VM_SAO, "sao" }
#elif defined(CONFIG_PARISC)
#define __VM_ARCH_SPECIFIC_1 {VM_GROWSUP, "growsup" }
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 07/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (5 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 06/11] mm: remove VM_PAT David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-13 17:43 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 08/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
` (4 subsequent siblings)
11 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
Always set to 0, so let's remove it.
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/x86/mm/pat/memtype.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index c88d1cbdc1de1..ccc55c00b4c8b 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -858,8 +858,7 @@ int memtype_kernel_map_sync(u64 base, unsigned long size,
* Reserved non RAM regions only and after successful memtype_reserve,
* this func also keeps identity mapping (if any) in sync with this new prot.
*/
-static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
- int strict_prot)
+static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot)
{
int is_ram = 0;
int ret;
@@ -895,8 +894,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
return ret;
if (pcm != want_pcm) {
- if (strict_prot ||
- !is_new_memtype_allowed(paddr, size, want_pcm, pcm)) {
+ if (!is_new_memtype_allowed(paddr, size, want_pcm, pcm)) {
memtype_free(paddr, paddr + size);
pr_err("x86/PAT: %s:%d map pfn expected mapping type %s for [mem %#010Lx-%#010Lx], got %s\n",
current->comm, current->pid,
@@ -906,10 +904,6 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
cattr_name(pcm));
return -EINVAL;
}
- /*
- * We allow returning different type than the one requested in
- * non strict case.
- */
pgprot_set_cachemode(vma_prot, pcm);
}
@@ -959,7 +953,7 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
{
const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
- return reserve_pfn_range(paddr, size, prot, 0);
+ return reserve_pfn_range(paddr, size, prot);
}
void pfnmap_untrack(unsigned long pfn, unsigned long size)
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 08/11] x86/mm/pat: remove MEMTYPE_*_MATCH
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (6 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 07/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-13 17:48 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase() David Hildenbrand
` (3 subsequent siblings)
11 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
The "memramp() shrinking" scenario no longer applies, so let's remove
that now-unnecessary handling.
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/x86/mm/pat/memtype_interval.c | 44 ++++--------------------------
1 file changed, 6 insertions(+), 38 deletions(-)
diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
index 645613d59942a..9d03f0dbc4715 100644
--- a/arch/x86/mm/pat/memtype_interval.c
+++ b/arch/x86/mm/pat/memtype_interval.c
@@ -49,26 +49,15 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
-enum {
- MEMTYPE_EXACT_MATCH = 0,
- MEMTYPE_END_MATCH = 1
-};
-
-static struct memtype *memtype_match(u64 start, u64 end, int match_type)
+static struct memtype *memtype_match(u64 start, u64 end)
{
struct memtype *entry_match;
entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
while (entry_match != NULL && entry_match->start < end) {
- if ((match_type == MEMTYPE_EXACT_MATCH) &&
- (entry_match->start == start) && (entry_match->end == end))
- return entry_match;
-
- if ((match_type == MEMTYPE_END_MATCH) &&
- (entry_match->start < start) && (entry_match->end == end))
+ if (entry_match->start == start && entry_match->end == end)
return entry_match;
-
entry_match = interval_iter_next(entry_match, start, end-1);
}
@@ -132,32 +121,11 @@ struct memtype *memtype_erase(u64 start, u64 end)
{
struct memtype *entry_old;
- /*
- * Since the memtype_rbroot tree allows overlapping ranges,
- * memtype_erase() checks with EXACT_MATCH first, i.e. free
- * a whole node for the munmap case. If no such entry is found,
- * it then checks with END_MATCH, i.e. shrink the size of a node
- * from the end for the mremap case.
- */
- entry_old = memtype_match(start, end, MEMTYPE_EXACT_MATCH);
- if (!entry_old) {
- entry_old = memtype_match(start, end, MEMTYPE_END_MATCH);
- if (!entry_old)
- return ERR_PTR(-EINVAL);
- }
-
- if (entry_old->start == start) {
- /* munmap: erase this node */
- interval_remove(entry_old, &memtype_rbroot);
- } else {
- /* mremap: update the end value of this node */
- interval_remove(entry_old, &memtype_rbroot);
- entry_old->end = start;
- interval_insert(entry_old, &memtype_rbroot);
-
- return NULL;
- }
+ entry_old = memtype_match(start, end);
+ if (!entry_old)
+ return ERR_PTR(-EINVAL);
+ interval_remove(entry_old, &memtype_rbroot);
return entry_old;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase()
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (7 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 08/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-12 16:49 ` Lorenzo Stoakes
2025-05-13 17:49 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
` (2 subsequent siblings)
11 siblings, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu
Let's just have it in a single function. The resulting function is
certainly small enough and readable.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/x86/mm/pat/memtype_interval.c | 33 +++++++++---------------------
1 file changed, 10 insertions(+), 23 deletions(-)
diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
index 9d03f0dbc4715..e5844ed1311ed 100644
--- a/arch/x86/mm/pat/memtype_interval.c
+++ b/arch/x86/mm/pat/memtype_interval.c
@@ -49,21 +49,6 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
-static struct memtype *memtype_match(u64 start, u64 end)
-{
- struct memtype *entry_match;
-
- entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
-
- while (entry_match != NULL && entry_match->start < end) {
- if (entry_match->start == start && entry_match->end == end)
- return entry_match;
- entry_match = interval_iter_next(entry_match, start, end-1);
- }
-
- return NULL; /* Returns NULL if there is no match */
-}
-
static int memtype_check_conflict(u64 start, u64 end,
enum page_cache_mode reqtype,
enum page_cache_mode *newtype)
@@ -119,14 +104,16 @@ int memtype_check_insert(struct memtype *entry_new, enum page_cache_mode *ret_ty
struct memtype *memtype_erase(u64 start, u64 end)
{
- struct memtype *entry_old;
-
- entry_old = memtype_match(start, end);
- if (!entry_old)
- return ERR_PTR(-EINVAL);
-
- interval_remove(entry_old, &memtype_rbroot);
- return entry_old;
+ struct memtype *entry = interval_iter_first(&memtype_rbroot, start, end - 1);
+
+ while (entry && entry->start < end) {
+ if (entry->start == start && entry->end == end) {
+ interval_remove(entry, &memtype_rbroot);
+ return entry;
+ }
+ entry = interval_iter_next(entry, start, end - 1);
+ }
+ return ERR_PTR(-EINVAL);
}
struct memtype *memtype_lookup(u64 addr)
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 10/11] drm/i915: track_pfn() -> "pfnmap tracking"
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (8 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase() David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-13 17:50 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 11/11] mm/io-mapping: " David Hildenbrand
2025-05-13 15:53 ` [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Liam R. Howlett
11 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
track_pfn() does not exist, let's simply refer to it as "pfnmap
tracking".
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
drivers/gpu/drm/i915/i915_mm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_mm.c b/drivers/gpu/drm/i915/i915_mm.c
index 76e2801619f09..c33bd3d830699 100644
--- a/drivers/gpu/drm/i915/i915_mm.c
+++ b/drivers/gpu/drm/i915/i915_mm.c
@@ -100,7 +100,7 @@ int remap_io_mapping(struct vm_area_struct *vma,
GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
- /* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+ /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
r.mm = vma->vm_mm;
r.pfn = pfn;
r.prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
@@ -140,7 +140,7 @@ int remap_io_sg(struct vm_area_struct *vma,
};
int err;
- /* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+ /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
while (offset >= r.sgt.max >> PAGE_SHIFT) {
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v2 11/11] mm/io-mapping: track_pfn() -> "pfnmap tracking"
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (9 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
@ 2025-05-12 12:34 ` David Hildenbrand
2025-05-13 17:50 ` Liam R. Howlett
2025-05-13 15:53 ` [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Liam R. Howlett
11 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-12 12:34 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, x86, intel-gfx, dri-devel, linux-trace-kernel,
David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
track_pfn() does not exist, let's simply refer to it as "pfnmap
tracking".
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/io-mapping.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/io-mapping.c b/mm/io-mapping.c
index f44a6a1347123..d3586e95c12c5 100644
--- a/mm/io-mapping.c
+++ b/mm/io-mapping.c
@@ -24,7 +24,7 @@ int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma,
pgprot_t remap_prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
(pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK));
- /* We rely on prevalidation of the io-mapping to skip track_pfn(). */
+ /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
return remap_pfn_range_notrack(vma, addr, pfn, size, remap_prot);
}
EXPORT_SYMBOL_GPL(io_mapping_map_user);
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()
2025-05-12 12:34 ` [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*() David Hildenbrand
@ 2025-05-12 15:43 ` Lorenzo Stoakes
2025-05-13 9:06 ` David Hildenbrand
2025-05-13 17:29 ` Liam R. Howlett
1 sibling, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-05-12 15:43 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
On Mon, May 12, 2025 at 02:34:15PM +0200, David Hildenbrand wrote:
> ... by factoring it out from track_pfn_remap() into
> pfnmap_setup_cachemode() and provide pfnmap_setup_cachemode_pfn() as
> a replacement for track_pfn_insert().
>
> For PMDs/PUDs, we keep checking a single pfn only. Add some documentation,
> and also document why it is valid to not check the whole pfn range.
>
> We'll reuse pfnmap_setup_cachemode() from core MM next.
>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
I've gone through carefully and checked and this looks good to me :)
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> arch/x86/mm/pat/memtype.c | 24 ++++++------------
> include/linux/pgtable.h | 52 +++++++++++++++++++++++++++++++++------
> mm/huge_memory.c | 5 ++--
> mm/memory.c | 4 +--
> 4 files changed, 57 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index edec5859651d6..fa78facc6f633 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1031,7 +1031,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> unsigned long pfn, unsigned long addr, unsigned long size)
> {
> resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> - enum page_cache_mode pcm;
>
> /* reserve the whole chunk starting from paddr */
> if (!vma || (addr == vma->vm_start
> @@ -1044,13 +1043,17 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> return ret;
> }
>
> + return pfnmap_setup_cachemode(pfn, size, prot);
> +}
> +
> +int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot)
> +{
> + resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> + enum page_cache_mode pcm;
> +
> if (!pat_enabled())
> return 0;
>
> - /*
> - * For anything smaller than the vma size we set prot based on the
> - * lookup.
> - */
> pcm = lookup_memtype(paddr);
>
> /* Check memtype for the remaining pages */
> @@ -1065,17 +1068,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> return 0;
> }
>
> -void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
> -{
> - enum page_cache_mode pcm;
> -
> - if (!pat_enabled())
> - return;
> -
> - pcm = lookup_memtype(pfn_t_to_phys(pfn));
> - pgprot_set_cachemode(prot, pcm);
> -}
> -
> /*
> * untrack_pfn is called while unmapping a pfnmap for a region.
> * untrack can be called for a specific region indicated by pfn and size or
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f1e890b604609..be1745839871c 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1496,13 +1496,10 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> return 0;
> }
>
> -/*
> - * track_pfn_insert is called when a _new_ single pfn is established
> - * by vmf_insert_pfn().
> - */
> -static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> - pfn_t pfn)
> +static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
> + pgprot_t *prot)
> {
> + return 0;
> }
>
> /*
> @@ -1552,8 +1549,32 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> unsigned long pfn, unsigned long addr,
> unsigned long size);
> -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> - pfn_t pfn);
> +
> +/**
> + * pfnmap_setup_cachemode - setup the cachemode in the pgprot for a pfn range
> + * @pfn: the start of the pfn range
> + * @size: the size of the pfn range in bytes
> + * @prot: the pgprot to modify
> + *
> + * Lookup the cachemode for the pfn range starting at @pfn with the size
> + * @size and store it in @prot, leaving other data in @prot unchanged.
> + *
> + * This allows for a hardware implementation to have fine-grained control of
> + * memory cache behavior at page level granularity. Without a hardware
> + * implementation, this function does nothing.
> + *
> + * Currently there is only one implementation for this - x86 Page Attribute
> + * Table (PAT). See Documentation/arch/x86/pat.rst for more details.
> + *
> + * This function can fail if the pfn range spans pfns that require differing
> + * cachemodes. If the pfn range was previously verified to have a single
> + * cachemode, it is sufficient to query only a single pfn. The assumption is
> + * that this is the case for drivers using the vmf_insert_pfn*() interface.
> + *
> + * Returns 0 on success and -EINVAL on error.
> + */
> +int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
> + pgprot_t *prot);
> extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> struct vm_area_struct *src_vma, unsigned long *pfn);
> extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> @@ -1563,6 +1584,21 @@ extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> extern void untrack_pfn_clear(struct vm_area_struct *vma);
> #endif
>
> +/**
> + * pfnmap_setup_cachemode_pfn - setup the cachemode in the pgprot for a pfn
> + * @pfn: the pfn
> + * @prot: the pgprot to modify
> + *
> + * Lookup the cachemode for @pfn and store it in @prot, leaving other
> + * data in @prot unchanged.
> + *
> + * See pfnmap_setup_cachemode() for details.
> + */
> +static inline void pfnmap_setup_cachemode_pfn(unsigned long pfn, pgprot_t *prot)
> +{
> + pfnmap_setup_cachemode(pfn, PAGE_SIZE, prot);
> +}
> +
> #ifdef CONFIG_MMU
> #ifdef __HAVE_COLOR_ZERO_PAGE
> static inline int is_zero_pfn(unsigned long pfn)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2780a12b25f01..d3e66136e41a3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1455,7 +1455,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> return VM_FAULT_OOM;
> }
>
> - track_pfn_insert(vma, &pgprot, pfn);
> + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
> +
> ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
> pgtable);
> @@ -1577,7 +1578,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
> if (addr < vma->vm_start || addr >= vma->vm_end)
> return VM_FAULT_SIGBUS;
>
> - track_pfn_insert(vma, &pgprot, pfn);
> + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>
> ptl = pud_lock(vma->vm_mm, vmf->pud);
> insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> diff --git a/mm/memory.c b/mm/memory.c
> index 99af83434e7c5..064fc55d8eab9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2564,7 +2564,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
> if (!pfn_modify_allowed(pfn, pgprot))
> return VM_FAULT_SIGBUS;
>
> - track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
> + pfnmap_setup_cachemode_pfn(pfn, &pgprot);
>
> return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
> false);
> @@ -2627,7 +2627,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
> if (addr < vma->vm_start || addr >= vma->vm_end)
> return VM_FAULT_SIGBUS;
>
> - track_pfn_insert(vma, &pgprot, pfn);
> + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>
> if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
> return VM_FAULT_SIGBUS;
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
2025-05-12 12:34 ` [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack() David Hildenbrand
@ 2025-05-12 16:42 ` Lorenzo Stoakes
2025-05-13 9:10 ` David Hildenbrand
2025-05-13 17:42 ` Liam R. Howlett
1 sibling, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-05-12 16:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
On Mon, May 12, 2025 at 02:34:17PM +0200, David Hildenbrand wrote:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only lookup the
> cachemode (partial VMA covered).
>
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.
>
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.
>
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.
>
> If that ever turns out to be an issue, we could hook into VM splitting
> code and split the tracking; however, that adds complexity that might
> not be required, so we'll keep it simple for now.
>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Other than small nit below,
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/mm_inline.h | 2 +
> include/linux/mm_types.h | 11 ++++++
> mm/memory.c | 82 +++++++++++++++++++++++++++++++--------
> mm/mmap.c | 5 ---
> mm/mremap.c | 4 --
> mm/vma_init.c | 50 ++++++++++++++++++++++++
> 6 files changed, 129 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>
> #endif /* CONFIG_ANON_VMA_NAME */
>
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
> static inline void init_tlb_flush_pending(struct mm_struct *mm)
> {
> atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 15808cad2bc1a..3e934dc6057c4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -763,6 +763,14 @@ struct vma_numab_state {
> int prev_scan_seq;
> };
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> + struct kref kref;
> + unsigned long pfn;
> + unsigned long size; /* in bytes */
> +};
> +#endif
> +
> /*
> * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> * manipulate mutable fields which will cause those fields to be updated in the
> @@ -900,6 +908,9 @@ struct vm_area_struct {
> struct anon_vma_name *anon_name;
> #endif
> struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING
> + struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif
> } __randomize_layout;
>
> #ifdef CONFIG_NUMA
> diff --git a/mm/memory.c b/mm/memory.c
> index 064fc55d8eab9..4cf4adb0de266 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1371,7 +1371,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> struct mm_struct *dst_mm = dst_vma->vm_mm;
> struct mm_struct *src_mm = src_vma->vm_mm;
> struct mmu_notifier_range range;
> - unsigned long next, pfn = 0;
> + unsigned long next;
> bool is_cow;
> int ret;
>
> @@ -1381,12 +1381,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> if (is_vm_hugetlb_page(src_vma))
> return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>
> - if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> - ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> - if (ret)
> - return ret;
> - }
> -
> /*
> * We need to invalidate the secondary MMU mappings only when
> * there could be a permission downgrade on the ptes of the
> @@ -1428,8 +1422,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> raw_write_seqcount_end(&src_mm->write_protect_seq);
> mmu_notifier_invalidate_range_end(&range);
> }
> - if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> - untrack_pfn_copy(dst_vma, pfn);
> return ret;
> }
>
> @@ -1924,9 +1916,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> if (vma->vm_file)
> uprobe_munmap(vma, start, end);
>
> - if (unlikely(vma->vm_flags & VM_PFNMAP))
> - untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
> if (start != end) {
> if (unlikely(is_vm_hugetlb_page(vma))) {
> /*
> @@ -2872,6 +2861,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> return error;
> }
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> + unsigned long size, pgprot_t *prot)
> +{
> + struct pfnmap_track_ctx *ctx;
> +
> + if (pfnmap_track(pfn, size, prot))
> + return ERR_PTR(-EINVAL);
> +
> + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> + if (unlikely(!ctx)) {
> + pfnmap_untrack(pfn, size);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + ctx->pfn = pfn;
> + ctx->size = size;
> + kref_init(&ctx->kref);
> + return ctx;
> +}
> +
> +void pfnmap_track_ctx_release(struct kref *ref)
> +{
> + struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> +
> + pfnmap_untrack(ctx->pfn, ctx->size);
> + kfree(ctx);
> +}
> +#endif /* __HAVE_PFNMAP_TRACKING */
> +
> /**
> * remap_pfn_range - remap kernel memory to userspace
> * @vma: user vma to map to
> @@ -2884,20 +2903,51 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> *
> * Return: %0 on success, negative error code otherwise.
> */
> +#ifdef __HAVE_PFNMAP_TRACKING
> int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t prot)
> {
> + struct pfnmap_track_ctx *ctx = NULL;
> int err;
>
> - err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> - if (err)
> + size = PAGE_ALIGN(size);
> +
> + /*
> + * If we cover the full VMA, we'll perform actual tracking, and
> + * remember to untrack when the last reference to our tracking
> + * context from a VMA goes away. We'll keep tracking the whole pfn
> + * range even during VMA splits and partial unmapping.
> + *
> + * If we only cover parts of the VMA, we'll only setup the cachemode
> + * in the pgprot for the pfn range.
> + */
> + if (addr == vma->vm_start && addr + size == vma->vm_end) {
> + if (vma->pfnmap_track_ctx)
> + return -EINVAL;
> + ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> + if (IS_ERR(ctx))
> + return PTR_ERR(ctx);
> + } else if (pfnmap_setup_cachemode(pfn, size, &prot)) {
> return -EINVAL;
> + }
>
> err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> - if (err)
> - untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> + if (ctx) {
> + if (err)
> + kref_put(&ctx->kref, pfnmap_track_ctx_release);
> + else
> + vma->pfnmap_track_ctx = ctx;
> + }
> return err;
> }
> +
> +#else
> +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pfn, unsigned long size, pgprot_t prot)
> +{
> + return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> +}
> +#endif
> EXPORT_SYMBOL(remap_pfn_range);
>
> /**
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 50f902c08341a..09c563c951123 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1784,11 +1784,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> tmp = vm_area_dup(mpnt);
> if (!tmp)
> goto fail_nomem;
> -
> - /* track_pfn_copy() will later take care of copying internal state. */
> - if (unlikely(tmp->vm_flags & VM_PFNMAP))
> - untrack_pfn_clear(tmp);
> -
> retval = vma_dup_policy(mpnt, tmp);
> if (retval)
> goto fail_nomem_policy;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84f..6e78e02f74bd3 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> if (is_vm_hugetlb_page(vma))
> clear_vma_resv_huge_pages(vma);
>
> - /* Tell pfnmap has moved from this vma */
> - if (unlikely(vma->vm_flags & VM_PFNMAP))
> - untrack_pfn_clear(vma);
> -
> *new_vma_ptr = new_vma;
> return err;
> }
> diff --git a/mm/vma_init.c b/mm/vma_init.c
> index 967ca85179864..8e53c7943561e 100644
> --- a/mm/vma_init.c
> +++ b/mm/vma_init.c
> @@ -71,7 +71,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
> #ifdef CONFIG_NUMA
> dest->vm_policy = src->vm_policy;
> #endif
> +#ifdef __HAVE_PFNMAP_TRACKING
> + dest->pfnmap_track_ctx = NULL;
> +#endif
> +}
> +
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> + struct vm_area_struct *new)
> +{
> + struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> +
> + if (likely(!ctx))
> + return 0;
> +
> + /*
> + * We don't expect to ever hit this. If ever required, we would have
> + * to duplicate the tracking.
> + */
> + if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
How not expected is this? :) maybe use WARN_ON_ONCE() if it really should
never happen?
> + return -ENOMEM;
> + kref_get(&ctx->kref);
> + new->pfnmap_track_ctx = ctx;
> + return 0;
> +}
> +
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> + struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
> +
> + if (likely(!ctx))
> + return;
> +
> + kref_put(&ctx->kref, pfnmap_track_ctx_release);
> + vma->pfnmap_track_ctx = NULL;
> +}
> +#else
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> + struct vm_area_struct *new)
> +{
> + return 0;
> }
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +}
> +#endif
>
> struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> {
> @@ -83,6 +127,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> vm_area_init_from(orig, new);
> +
> + if (vma_pfnmap_track_ctx_dup(orig, new)) {
> + kmem_cache_free(vm_area_cachep, new);
> + return NULL;
> + }
> vma_lock_init(new, true);
> INIT_LIST_HEAD(&new->anon_vma_chain);
> vma_numab_state_init(new);
> @@ -97,5 +146,6 @@ void vm_area_free(struct vm_area_struct *vma)
> vma_assert_detached(vma);
> vma_numab_state_free(vma);
> free_anon_vma_name(vma);
> + vma_pfnmap_track_ctx_release(vma);
> kmem_cache_free(vm_area_cachep, vma);
> }
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase()
2025-05-12 12:34 ` [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase() David Hildenbrand
@ 2025-05-12 16:49 ` Lorenzo Stoakes
2025-05-13 9:11 ` David Hildenbrand
2025-05-13 17:49 ` Liam R. Howlett
1 sibling, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-05-12 16:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu
On Mon, May 12, 2025 at 02:34:22PM +0200, David Hildenbrand wrote:
> Let's just have it in a single function. The resulting function is
> certainly small enough and readable.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Nice, great bit of refactoring :) the new version is considerably clearer.
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> arch/x86/mm/pat/memtype_interval.c | 33 +++++++++---------------------
> 1 file changed, 10 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
> index 9d03f0dbc4715..e5844ed1311ed 100644
> --- a/arch/x86/mm/pat/memtype_interval.c
> +++ b/arch/x86/mm/pat/memtype_interval.c
> @@ -49,21 +49,6 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
>
> static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
>
> -static struct memtype *memtype_match(u64 start, u64 end)
> -{
> - struct memtype *entry_match;
> -
> - entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
> -
> - while (entry_match != NULL && entry_match->start < end) {
> - if (entry_match->start == start && entry_match->end == end)
> - return entry_match;
> - entry_match = interval_iter_next(entry_match, start, end-1);
> - }
> -
> - return NULL; /* Returns NULL if there is no match */
> -}
> -
> static int memtype_check_conflict(u64 start, u64 end,
> enum page_cache_mode reqtype,
> enum page_cache_mode *newtype)
> @@ -119,14 +104,16 @@ int memtype_check_insert(struct memtype *entry_new, enum page_cache_mode *ret_ty
>
> struct memtype *memtype_erase(u64 start, u64 end)
> {
> - struct memtype *entry_old;
> -
> - entry_old = memtype_match(start, end);
> - if (!entry_old)
> - return ERR_PTR(-EINVAL);
> -
> - interval_remove(entry_old, &memtype_rbroot);
> - return entry_old;
> + struct memtype *entry = interval_iter_first(&memtype_rbroot, start, end - 1);
> +
> + while (entry && entry->start < end) {
> + if (entry->start == start && entry->end == end) {
> + interval_remove(entry, &memtype_rbroot);
> + return entry;
> + }
> + entry = interval_iter_next(entry, start, end - 1);
> + }
> + return ERR_PTR(-EINVAL);
> }
>
> struct memtype *memtype_lookup(u64 addr)
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()
2025-05-12 15:43 ` Lorenzo Stoakes
@ 2025-05-13 9:06 ` David Hildenbrand
0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-13 9:06 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
On 12.05.25 17:43, Lorenzo Stoakes wrote:
> On Mon, May 12, 2025 at 02:34:15PM +0200, David Hildenbrand wrote:
>> ... by factoring it out from track_pfn_remap() into
>> pfnmap_setup_cachemode() and provide pfnmap_setup_cachemode_pfn() as
>> a replacement for track_pfn_insert().
>>
>> For PMDs/PUDs, we keep checking a single pfn only. Add some documentation,
>> and also document why it is valid to not check the whole pfn range.
>>
>> We'll reuse pfnmap_setup_cachemode() from core MM next.
>>
>> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> I've gone through carefully and checked and this looks good to me :)
Thanks a bunch!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
2025-05-12 16:42 ` Lorenzo Stoakes
@ 2025-05-13 9:10 ` David Hildenbrand
2025-05-13 10:16 ` Lorenzo Stoakes
0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-13 9:10 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
On 12.05.25 18:42, Lorenzo Stoakes wrote:
> On Mon, May 12, 2025 at 02:34:17PM +0200, David Hildenbrand wrote:
>> Let's use our new interface. In remap_pfn_range(), we'll now decide
>> whether we have to track (full VMA covered) or only lookup the
>> cachemode (partial VMA covered).
>>
>> Remember what we have to untrack by linking it from the VMA. When
>> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
>> to anon VMA names, and use a kref to share the tracking.
>>
>> Once the last VMA un-refs our tracking data, we'll do the untracking,
>> which simplifies things a lot and should sort our various issues we saw
>> recently, for example, when partially unmapping/zapping a tracked VMA.
>>
>> This change implies that we'll keep tracking the original PFN range even
>> after splitting + partially unmapping it: not too bad, because it was
>> not working reliably before. The only thing that kind-of worked before
>> was shrinking such a mapping using mremap(): we managed to adjust the
>> reservation in a hacky way, now we won't adjust the reservation but
>> leave it around until all involved VMAs are gone.
>>
>> If that ever turns out to be an issue, we could hook into VM splitting
>> code and split the tracking; however, that adds complexity that might
>> not be required, so we'll keep it simple for now.
>>
>> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Other than small nit below,
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
>> ---
>> include/linux/mm_inline.h | 2 +
>> include/linux/mm_types.h | 11 ++++++
>> mm/memory.c | 82 +++++++++++++++++++++++++++++++--------
>> mm/mmap.c | 5 ---
>> mm/mremap.c | 4 --
>> mm/vma_init.c | 50 ++++++++++++++++++++++++
>> 6 files changed, 129 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index f9157a0c42a5c..89b518ff097e6 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>
>> #endif /* CONFIG_ANON_VMA_NAME */
>>
>> +void pfnmap_track_ctx_release(struct kref *ref);
>> +
>> static inline void init_tlb_flush_pending(struct mm_struct *mm)
>> {
>> atomic_set(&mm->tlb_flush_pending, 0);
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 15808cad2bc1a..3e934dc6057c4 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -763,6 +763,14 @@ struct vma_numab_state {
>> int prev_scan_seq;
>> };
>>
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +struct pfnmap_track_ctx {
>> + struct kref kref;
>> + unsigned long pfn;
>> + unsigned long size; /* in bytes */
>> +};
>> +#endif
>> +
>> /*
>> * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
>> * manipulate mutable fields which will cause those fields to be updated in the
>> @@ -900,6 +908,9 @@ struct vm_area_struct {
>> struct anon_vma_name *anon_name;
>> #endif
>> struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> + struct pfnmap_track_ctx *pfnmap_track_ctx;
>> +#endif
>> } __randomize_layout;
>>
>> #ifdef CONFIG_NUMA
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 064fc55d8eab9..4cf4adb0de266 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1371,7 +1371,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>> struct mm_struct *dst_mm = dst_vma->vm_mm;
>> struct mm_struct *src_mm = src_vma->vm_mm;
>> struct mmu_notifier_range range;
>> - unsigned long next, pfn = 0;
>> + unsigned long next;
>> bool is_cow;
>> int ret;
>>
>> @@ -1381,12 +1381,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>> if (is_vm_hugetlb_page(src_vma))
>> return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>>
>> - if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
>> - ret = track_pfn_copy(dst_vma, src_vma, &pfn);
>> - if (ret)
>> - return ret;
>> - }
>> -
>> /*
>> * We need to invalidate the secondary MMU mappings only when
>> * there could be a permission downgrade on the ptes of the
>> @@ -1428,8 +1422,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>> raw_write_seqcount_end(&src_mm->write_protect_seq);
>> mmu_notifier_invalidate_range_end(&range);
>> }
>> - if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
>> - untrack_pfn_copy(dst_vma, pfn);
>> return ret;
>> }
>>
>> @@ -1924,9 +1916,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>> if (vma->vm_file)
>> uprobe_munmap(vma, start, end);
>>
>> - if (unlikely(vma->vm_flags & VM_PFNMAP))
>> - untrack_pfn(vma, 0, 0, mm_wr_locked);
>> -
>> if (start != end) {
>> if (unlikely(is_vm_hugetlb_page(vma))) {
>> /*
>> @@ -2872,6 +2861,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>> return error;
>> }
>>
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
>> + unsigned long size, pgprot_t *prot)
>> +{
>> + struct pfnmap_track_ctx *ctx;
>> +
>> + if (pfnmap_track(pfn, size, prot))
>> + return ERR_PTR(-EINVAL);
>> +
>> + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
>> + if (unlikely(!ctx)) {
>> + pfnmap_untrack(pfn, size);
>> + return ERR_PTR(-ENOMEM);
>> + }
>> +
>> + ctx->pfn = pfn;
>> + ctx->size = size;
>> + kref_init(&ctx->kref);
>> + return ctx;
>> +}
>> +
>> +void pfnmap_track_ctx_release(struct kref *ref)
>> +{
>> + struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
>> +
>> + pfnmap_untrack(ctx->pfn, ctx->size);
>> + kfree(ctx);
>> +}
>> +#endif /* __HAVE_PFNMAP_TRACKING */
>> +
>> /**
>> * remap_pfn_range - remap kernel memory to userspace
>> * @vma: user vma to map to
>> @@ -2884,20 +2903,51 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>> *
>> * Return: %0 on success, negative error code otherwise.
>> */
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>> unsigned long pfn, unsigned long size, pgprot_t prot)
>> {
>> + struct pfnmap_track_ctx *ctx = NULL;
>> int err;
>>
>> - err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
>> - if (err)
>> + size = PAGE_ALIGN(size);
>> +
>> + /*
>> + * If we cover the full VMA, we'll perform actual tracking, and
>> + * remember to untrack when the last reference to our tracking
>> + * context from a VMA goes away. We'll keep tracking the whole pfn
>> + * range even during VMA splits and partial unmapping.
>> + *
>> + * If we only cover parts of the VMA, we'll only setup the cachemode
>> + * in the pgprot for the pfn range.
>> + */
>> + if (addr == vma->vm_start && addr + size == vma->vm_end) {
>> + if (vma->pfnmap_track_ctx)
>> + return -EINVAL;
>> + ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
>> + if (IS_ERR(ctx))
>> + return PTR_ERR(ctx);
>> + } else if (pfnmap_setup_cachemode(pfn, size, &prot)) {
>> return -EINVAL;
>> + }
>>
>> err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
>> - if (err)
>> - untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
>> + if (ctx) {
>> + if (err)
>> + kref_put(&ctx->kref, pfnmap_track_ctx_release);
>> + else
>> + vma->pfnmap_track_ctx = ctx;
>> + }
>> return err;
>> }
>> +
>> +#else
>> +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>> + unsigned long pfn, unsigned long size, pgprot_t prot)
>> +{
>> + return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
>> +}
>> +#endif
>> EXPORT_SYMBOL(remap_pfn_range);
>>
>> /**
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 50f902c08341a..09c563c951123 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -1784,11 +1784,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>> tmp = vm_area_dup(mpnt);
>> if (!tmp)
>> goto fail_nomem;
>> -
>> - /* track_pfn_copy() will later take care of copying internal state. */
>> - if (unlikely(tmp->vm_flags & VM_PFNMAP))
>> - untrack_pfn_clear(tmp);
>> -
>> retval = vma_dup_policy(mpnt, tmp);
>> if (retval)
>> goto fail_nomem_policy;
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 7db9da609c84f..6e78e02f74bd3 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>> if (is_vm_hugetlb_page(vma))
>> clear_vma_resv_huge_pages(vma);
>>
>> - /* Tell pfnmap has moved from this vma */
>> - if (unlikely(vma->vm_flags & VM_PFNMAP))
>> - untrack_pfn_clear(vma);
>> -
>> *new_vma_ptr = new_vma;
>> return err;
>> }
>> diff --git a/mm/vma_init.c b/mm/vma_init.c
>> index 967ca85179864..8e53c7943561e 100644
>> --- a/mm/vma_init.c
>> +++ b/mm/vma_init.c
>> @@ -71,7 +71,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
>> #ifdef CONFIG_NUMA
>> dest->vm_policy = src->vm_policy;
>> #endif
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> + dest->pfnmap_track_ctx = NULL;
>> +#endif
>> +}
>> +
>> +#ifdef __HAVE_PFNMAP_TRACKING
>> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
>> + struct vm_area_struct *new)
>> +{
>> + struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
>> +
>> + if (likely(!ctx))
>> + return 0;
>> +
>> + /*
>> + * We don't expect to ever hit this. If ever required, we would have
>> + * to duplicate the tracking.
>> + */
>> + if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
>
> How not expected is this? :) maybe use WARN_ON_ONCE() if it really should
> never happen?
I guess if we mmap a large PFNMAP and then split it into individual
PTE-sized chunks, we could get many VMAs per-process referencing that
tracing.
Combine that with fork() and I assume one could hit this -- when really
trying hard to achieve it. (probably as privileged user to get a big
VM_PFNMAP, though, but not sure)
In that case, a WARN_ON_ONCE() would be bad -- because it could be
triggered by the user.
We could do a pr_warn_once() instead, stating that this is not supported
right now?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase()
2025-05-12 16:49 ` Lorenzo Stoakes
@ 2025-05-13 9:11 ` David Hildenbrand
0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-13 9:11 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu
On 12.05.25 18:49, Lorenzo Stoakes wrote:
> On Mon, May 12, 2025 at 02:34:22PM +0200, David Hildenbrand wrote:
>> Let's just have it in a single function. The resulting function is
>> certainly small enough and readable.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Nice, great bit of refactoring :) the new version is considerably clearer.
Thanks!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
2025-05-13 9:10 ` David Hildenbrand
@ 2025-05-13 10:16 ` Lorenzo Stoakes
2025-05-13 10:22 ` David Hildenbrand
0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2025-05-13 10:16 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
On Tue, May 13, 2025 at 11:10:45AM +0200, David Hildenbrand wrote:
> On 12.05.25 18:42, Lorenzo Stoakes wrote:
> > On Mon, May 12, 2025 at 02:34:17PM +0200, David Hildenbrand wrote:
> > > Let's use our new interface. In remap_pfn_range(), we'll now decide
> > > whether we have to track (full VMA covered) or only lookup the
> > > cachemode (partial VMA covered).
> > >
> > > Remember what we have to untrack by linking it from the VMA. When
> > > duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> > > to anon VMA names, and use a kref to share the tracking.
> > >
> > > Once the last VMA un-refs our tracking data, we'll do the untracking,
> > > which simplifies things a lot and should sort our various issues we saw
> > > recently, for example, when partially unmapping/zapping a tracked VMA.
> > >
> > > This change implies that we'll keep tracking the original PFN range even
> > > after splitting + partially unmapping it: not too bad, because it was
> > > not working reliably before. The only thing that kind-of worked before
> > > was shrinking such a mapping using mremap(): we managed to adjust the
> > > reservation in a hacky way, now we won't adjust the reservation but
> > > leave it around until all involved VMAs are gone.
> > >
> > > If that ever turns out to be an issue, we could hook into VM splitting
> > > code and split the tracking; however, that adds complexity that might
> > > not be required, so we'll keep it simple for now.
> > >
> > > Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > Other than small nit below,
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > > ---
> > > include/linux/mm_inline.h | 2 +
> > > include/linux/mm_types.h | 11 ++++++
> > > mm/memory.c | 82 +++++++++++++++++++++++++++++++--------
> > > mm/mmap.c | 5 ---
> > > mm/mremap.c | 4 --
> > > mm/vma_init.c | 50 ++++++++++++++++++++++++
> > > 6 files changed, 129 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > index f9157a0c42a5c..89b518ff097e6 100644
> > > --- a/include/linux/mm_inline.h
> > > +++ b/include/linux/mm_inline.h
> > > @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
> > >
> > > #endif /* CONFIG_ANON_VMA_NAME */
> > >
> > > +void pfnmap_track_ctx_release(struct kref *ref);
> > > +
> > > static inline void init_tlb_flush_pending(struct mm_struct *mm)
> > > {
> > > atomic_set(&mm->tlb_flush_pending, 0);
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 15808cad2bc1a..3e934dc6057c4 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -763,6 +763,14 @@ struct vma_numab_state {
> > > int prev_scan_seq;
> > > };
> > >
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > +struct pfnmap_track_ctx {
> > > + struct kref kref;
> > > + unsigned long pfn;
> > > + unsigned long size; /* in bytes */
> > > +};
> > > +#endif
> > > +
> > > /*
> > > * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> > > * manipulate mutable fields which will cause those fields to be updated in the
> > > @@ -900,6 +908,9 @@ struct vm_area_struct {
> > > struct anon_vma_name *anon_name;
> > > #endif
> > > struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > + struct pfnmap_track_ctx *pfnmap_track_ctx;
> > > +#endif
> > > } __randomize_layout;
> > >
> > > #ifdef CONFIG_NUMA
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 064fc55d8eab9..4cf4adb0de266 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1371,7 +1371,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > struct mm_struct *dst_mm = dst_vma->vm_mm;
> > > struct mm_struct *src_mm = src_vma->vm_mm;
> > > struct mmu_notifier_range range;
> > > - unsigned long next, pfn = 0;
> > > + unsigned long next;
> > > bool is_cow;
> > > int ret;
> > >
> > > @@ -1381,12 +1381,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > if (is_vm_hugetlb_page(src_vma))
> > > return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
> > >
> > > - if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> > > - ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> > > - if (ret)
> > > - return ret;
> > > - }
> > > -
> > > /*
> > > * We need to invalidate the secondary MMU mappings only when
> > > * there could be a permission downgrade on the ptes of the
> > > @@ -1428,8 +1422,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > raw_write_seqcount_end(&src_mm->write_protect_seq);
> > > mmu_notifier_invalidate_range_end(&range);
> > > }
> > > - if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> > > - untrack_pfn_copy(dst_vma, pfn);
> > > return ret;
> > > }
> > >
> > > @@ -1924,9 +1916,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > > if (vma->vm_file)
> > > uprobe_munmap(vma, start, end);
> > >
> > > - if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > - untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > -
> > > if (start != end) {
> > > if (unlikely(is_vm_hugetlb_page(vma))) {
> > > /*
> > > @@ -2872,6 +2861,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> > > return error;
> > > }
> > >
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> > > + unsigned long size, pgprot_t *prot)
> > > +{
> > > + struct pfnmap_track_ctx *ctx;
> > > +
> > > + if (pfnmap_track(pfn, size, prot))
> > > + return ERR_PTR(-EINVAL);
> > > +
> > > + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> > > + if (unlikely(!ctx)) {
> > > + pfnmap_untrack(pfn, size);
> > > + return ERR_PTR(-ENOMEM);
> > > + }
> > > +
> > > + ctx->pfn = pfn;
> > > + ctx->size = size;
> > > + kref_init(&ctx->kref);
> > > + return ctx;
> > > +}
> > > +
> > > +void pfnmap_track_ctx_release(struct kref *ref)
> > > +{
> > > + struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> > > +
> > > + pfnmap_untrack(ctx->pfn, ctx->size);
> > > + kfree(ctx);
> > > +}
> > > +#endif /* __HAVE_PFNMAP_TRACKING */
> > > +
> > > /**
> > > * remap_pfn_range - remap kernel memory to userspace
> > > * @vma: user vma to map to
> > > @@ -2884,20 +2903,51 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> > > *
> > > * Return: %0 on success, negative error code otherwise.
> > > */
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> > > unsigned long pfn, unsigned long size, pgprot_t prot)
> > > {
> > > + struct pfnmap_track_ctx *ctx = NULL;
> > > int err;
> > >
> > > - err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> > > - if (err)
> > > + size = PAGE_ALIGN(size);
> > > +
> > > + /*
> > > + * If we cover the full VMA, we'll perform actual tracking, and
> > > + * remember to untrack when the last reference to our tracking
> > > + * context from a VMA goes away. We'll keep tracking the whole pfn
> > > + * range even during VMA splits and partial unmapping.
> > > + *
> > > + * If we only cover parts of the VMA, we'll only setup the cachemode
> > > + * in the pgprot for the pfn range.
> > > + */
> > > + if (addr == vma->vm_start && addr + size == vma->vm_end) {
> > > + if (vma->pfnmap_track_ctx)
> > > + return -EINVAL;
> > > + ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> > > + if (IS_ERR(ctx))
> > > + return PTR_ERR(ctx);
> > > + } else if (pfnmap_setup_cachemode(pfn, size, &prot)) {
> > > return -EINVAL;
> > > + }
> > >
> > > err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> > > - if (err)
> > > - untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> > > + if (ctx) {
> > > + if (err)
> > > + kref_put(&ctx->kref, pfnmap_track_ctx_release);
> > > + else
> > > + vma->pfnmap_track_ctx = ctx;
> > > + }
> > > return err;
> > > }
> > > +
> > > +#else
> > > +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> > > + unsigned long pfn, unsigned long size, pgprot_t prot)
> > > +{
> > > + return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> > > +}
> > > +#endif
> > > EXPORT_SYMBOL(remap_pfn_range);
> > >
> > > /**
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 50f902c08341a..09c563c951123 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1784,11 +1784,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> > > tmp = vm_area_dup(mpnt);
> > > if (!tmp)
> > > goto fail_nomem;
> > > -
> > > - /* track_pfn_copy() will later take care of copying internal state. */
> > > - if (unlikely(tmp->vm_flags & VM_PFNMAP))
> > > - untrack_pfn_clear(tmp);
> > > -
> > > retval = vma_dup_policy(mpnt, tmp);
> > > if (retval)
> > > goto fail_nomem_policy;
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 7db9da609c84f..6e78e02f74bd3 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> > > if (is_vm_hugetlb_page(vma))
> > > clear_vma_resv_huge_pages(vma);
> > >
> > > - /* Tell pfnmap has moved from this vma */
> > > - if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > - untrack_pfn_clear(vma);
> > > -
> > > *new_vma_ptr = new_vma;
> > > return err;
> > > }
> > > diff --git a/mm/vma_init.c b/mm/vma_init.c
> > > index 967ca85179864..8e53c7943561e 100644
> > > --- a/mm/vma_init.c
> > > +++ b/mm/vma_init.c
> > > @@ -71,7 +71,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
> > > #ifdef CONFIG_NUMA
> > > dest->vm_policy = src->vm_policy;
> > > #endif
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > + dest->pfnmap_track_ctx = NULL;
> > > +#endif
> > > +}
> > > +
> > > +#ifdef __HAVE_PFNMAP_TRACKING
> > > +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> > > + struct vm_area_struct *new)
> > > +{
> > > + struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> > > +
> > > + if (likely(!ctx))
> > > + return 0;
> > > +
> > > + /*
> > > + * We don't expect to ever hit this. If ever required, we would have
> > > + * to duplicate the tracking.
> > > + */
> > > + if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
> >
> > How not expected is this? :) maybe use WARN_ON_ONCE() if it really should
> > never happen?
> I guess if we mmap a large PFNMAP and then split it into individual
> PTE-sized chunks, we could get many VMAs per-process referencing that
> tracing.
>
> Combine that with fork() and I assume one could hit this -- when really
> trying hard to achieve it. (probably as privileged user to get a big
> VM_PFNMAP, though, but not sure)
Right ok, yeah I guess so. It'd be good to see if we could trigger it somehow :)
>
> In that case, a WARN_ON_ONCE() would be bad -- because it could be triggered
> by the user.
Ack
>
> We could do a pr_warn_once() instead, stating that this is not supported
> right now?
Hmm, if we truly think it might happen let's avoid printing anything for now.
Maybe just ++todo for experimenting with triggering?
It's not hugely important!
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
2025-05-13 10:16 ` Lorenzo Stoakes
@ 2025-05-13 10:22 ` David Hildenbrand
0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-13 10:22 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Liam R. Howlett,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
On 13.05.25 12:16, Lorenzo Stoakes wrote:
> On Tue, May 13, 2025 at 11:10:45AM +0200, David Hildenbrand wrote:
>> On 12.05.25 18:42, Lorenzo Stoakes wrote:
>>> On Mon, May 12, 2025 at 02:34:17PM +0200, David Hildenbrand wrote:
>>>> Let's use our new interface. In remap_pfn_range(), we'll now decide
>>>> whether we have to track (full VMA covered) or only lookup the
>>>> cachemode (partial VMA covered).
>>>>
>>>> Remember what we have to untrack by linking it from the VMA. When
>>>> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
>>>> to anon VMA names, and use a kref to share the tracking.
>>>>
>>>> Once the last VMA un-refs our tracking data, we'll do the untracking,
>>>> which simplifies things a lot and should sort our various issues we saw
>>>> recently, for example, when partially unmapping/zapping a tracked VMA.
>>>>
>>>> This change implies that we'll keep tracking the original PFN range even
>>>> after splitting + partially unmapping it: not too bad, because it was
>>>> not working reliably before. The only thing that kind-of worked before
>>>> was shrinking such a mapping using mremap(): we managed to adjust the
>>>> reservation in a hacky way, now we won't adjust the reservation but
>>>> leave it around until all involved VMAs are gone.
>>>>
>>>> If that ever turns out to be an issue, we could hook into VM splitting
>>>> code and split the tracking; however, that adds complexity that might
>>>> not be required, so we'll keep it simple for now.
>>>>
>>>> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> Other than small nit below,
>>>
>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>
>>>> ---
>>>> include/linux/mm_inline.h | 2 +
>>>> include/linux/mm_types.h | 11 ++++++
>>>> mm/memory.c | 82 +++++++++++++++++++++++++++++++--------
>>>> mm/mmap.c | 5 ---
>>>> mm/mremap.c | 4 --
>>>> mm/vma_init.c | 50 ++++++++++++++++++++++++
>>>> 6 files changed, 129 insertions(+), 25 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>>>> index f9157a0c42a5c..89b518ff097e6 100644
>>>> --- a/include/linux/mm_inline.h
>>>> +++ b/include/linux/mm_inline.h
>>>> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>>>>
>>>> #endif /* CONFIG_ANON_VMA_NAME */
>>>>
>>>> +void pfnmap_track_ctx_release(struct kref *ref);
>>>> +
>>>> static inline void init_tlb_flush_pending(struct mm_struct *mm)
>>>> {
>>>> atomic_set(&mm->tlb_flush_pending, 0);
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 15808cad2bc1a..3e934dc6057c4 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -763,6 +763,14 @@ struct vma_numab_state {
>>>> int prev_scan_seq;
>>>> };
>>>>
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +struct pfnmap_track_ctx {
>>>> + struct kref kref;
>>>> + unsigned long pfn;
>>>> + unsigned long size; /* in bytes */
>>>> +};
>>>> +#endif
>>>> +
>>>> /*
>>>> * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
>>>> * manipulate mutable fields which will cause those fields to be updated in the
>>>> @@ -900,6 +908,9 @@ struct vm_area_struct {
>>>> struct anon_vma_name *anon_name;
>>>> #endif
>>>> struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> + struct pfnmap_track_ctx *pfnmap_track_ctx;
>>>> +#endif
>>>> } __randomize_layout;
>>>>
>>>> #ifdef CONFIG_NUMA
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 064fc55d8eab9..4cf4adb0de266 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1371,7 +1371,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>>> struct mm_struct *dst_mm = dst_vma->vm_mm;
>>>> struct mm_struct *src_mm = src_vma->vm_mm;
>>>> struct mmu_notifier_range range;
>>>> - unsigned long next, pfn = 0;
>>>> + unsigned long next;
>>>> bool is_cow;
>>>> int ret;
>>>>
>>>> @@ -1381,12 +1381,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>>> if (is_vm_hugetlb_page(src_vma))
>>>> return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>>>>
>>>> - if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
>>>> - ret = track_pfn_copy(dst_vma, src_vma, &pfn);
>>>> - if (ret)
>>>> - return ret;
>>>> - }
>>>> -
>>>> /*
>>>> * We need to invalidate the secondary MMU mappings only when
>>>> * there could be a permission downgrade on the ptes of the
>>>> @@ -1428,8 +1422,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>>>> raw_write_seqcount_end(&src_mm->write_protect_seq);
>>>> mmu_notifier_invalidate_range_end(&range);
>>>> }
>>>> - if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
>>>> - untrack_pfn_copy(dst_vma, pfn);
>>>> return ret;
>>>> }
>>>>
>>>> @@ -1924,9 +1916,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>>>> if (vma->vm_file)
>>>> uprobe_munmap(vma, start, end);
>>>>
>>>> - if (unlikely(vma->vm_flags & VM_PFNMAP))
>>>> - untrack_pfn(vma, 0, 0, mm_wr_locked);
>>>> -
>>>> if (start != end) {
>>>> if (unlikely(is_vm_hugetlb_page(vma))) {
>>>> /*
>>>> @@ -2872,6 +2861,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>>>> return error;
>>>> }
>>>>
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
>>>> + unsigned long size, pgprot_t *prot)
>>>> +{
>>>> + struct pfnmap_track_ctx *ctx;
>>>> +
>>>> + if (pfnmap_track(pfn, size, prot))
>>>> + return ERR_PTR(-EINVAL);
>>>> +
>>>> + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
>>>> + if (unlikely(!ctx)) {
>>>> + pfnmap_untrack(pfn, size);
>>>> + return ERR_PTR(-ENOMEM);
>>>> + }
>>>> +
>>>> + ctx->pfn = pfn;
>>>> + ctx->size = size;
>>>> + kref_init(&ctx->kref);
>>>> + return ctx;
>>>> +}
>>>> +
>>>> +void pfnmap_track_ctx_release(struct kref *ref)
>>>> +{
>>>> + struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
>>>> +
>>>> + pfnmap_untrack(ctx->pfn, ctx->size);
>>>> + kfree(ctx);
>>>> +}
>>>> +#endif /* __HAVE_PFNMAP_TRACKING */
>>>> +
>>>> /**
>>>> * remap_pfn_range - remap kernel memory to userspace
>>>> * @vma: user vma to map to
>>>> @@ -2884,20 +2903,51 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
>>>> *
>>>> * Return: %0 on success, negative error code otherwise.
>>>> */
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>>>> unsigned long pfn, unsigned long size, pgprot_t prot)
>>>> {
>>>> + struct pfnmap_track_ctx *ctx = NULL;
>>>> int err;
>>>>
>>>> - err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
>>>> - if (err)
>>>> + size = PAGE_ALIGN(size);
>>>> +
>>>> + /*
>>>> + * If we cover the full VMA, we'll perform actual tracking, and
>>>> + * remember to untrack when the last reference to our tracking
>>>> + * context from a VMA goes away. We'll keep tracking the whole pfn
>>>> + * range even during VMA splits and partial unmapping.
>>>> + *
>>>> + * If we only cover parts of the VMA, we'll only setup the cachemode
>>>> + * in the pgprot for the pfn range.
>>>> + */
>>>> + if (addr == vma->vm_start && addr + size == vma->vm_end) {
>>>> + if (vma->pfnmap_track_ctx)
>>>> + return -EINVAL;
>>>> + ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
>>>> + if (IS_ERR(ctx))
>>>> + return PTR_ERR(ctx);
>>>> + } else if (pfnmap_setup_cachemode(pfn, size, &prot)) {
>>>> return -EINVAL;
>>>> + }
>>>>
>>>> err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
>>>> - if (err)
>>>> - untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
>>>> + if (ctx) {
>>>> + if (err)
>>>> + kref_put(&ctx->kref, pfnmap_track_ctx_release);
>>>> + else
>>>> + vma->pfnmap_track_ctx = ctx;
>>>> + }
>>>> return err;
>>>> }
>>>> +
>>>> +#else
>>>> +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
>>>> + unsigned long pfn, unsigned long size, pgprot_t prot)
>>>> +{
>>>> + return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
>>>> +}
>>>> +#endif
>>>> EXPORT_SYMBOL(remap_pfn_range);
>>>>
>>>> /**
>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>> index 50f902c08341a..09c563c951123 100644
>>>> --- a/mm/mmap.c
>>>> +++ b/mm/mmap.c
>>>> @@ -1784,11 +1784,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>>>> tmp = vm_area_dup(mpnt);
>>>> if (!tmp)
>>>> goto fail_nomem;
>>>> -
>>>> - /* track_pfn_copy() will later take care of copying internal state. */
>>>> - if (unlikely(tmp->vm_flags & VM_PFNMAP))
>>>> - untrack_pfn_clear(tmp);
>>>> -
>>>> retval = vma_dup_policy(mpnt, tmp);
>>>> if (retval)
>>>> goto fail_nomem_policy;
>>>> diff --git a/mm/mremap.c b/mm/mremap.c
>>>> index 7db9da609c84f..6e78e02f74bd3 100644
>>>> --- a/mm/mremap.c
>>>> +++ b/mm/mremap.c
>>>> @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
>>>> if (is_vm_hugetlb_page(vma))
>>>> clear_vma_resv_huge_pages(vma);
>>>>
>>>> - /* Tell pfnmap has moved from this vma */
>>>> - if (unlikely(vma->vm_flags & VM_PFNMAP))
>>>> - untrack_pfn_clear(vma);
>>>> -
>>>> *new_vma_ptr = new_vma;
>>>> return err;
>>>> }
>>>> diff --git a/mm/vma_init.c b/mm/vma_init.c
>>>> index 967ca85179864..8e53c7943561e 100644
>>>> --- a/mm/vma_init.c
>>>> +++ b/mm/vma_init.c
>>>> @@ -71,7 +71,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
>>>> #ifdef CONFIG_NUMA
>>>> dest->vm_policy = src->vm_policy;
>>>> #endif
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> + dest->pfnmap_track_ctx = NULL;
>>>> +#endif
>>>> +}
>>>> +
>>>> +#ifdef __HAVE_PFNMAP_TRACKING
>>>> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
>>>> + struct vm_area_struct *new)
>>>> +{
>>>> + struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
>>>> +
>>>> + if (likely(!ctx))
>>>> + return 0;
>>>> +
>>>> + /*
>>>> + * We don't expect to ever hit this. If ever required, we would have
>>>> + * to duplicate the tracking.
>>>> + */
>>>> + if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
>>>
>>> How not expected is this? :) maybe use WARN_ON_ONCE() if it really should
>>> never happen?
>> I guess if we mmap a large PFNMAP and then split it into individual
>> PTE-sized chunks, we could get many VMAs per-process referencing that
>> tracing.
>>
>> Combine that with fork() and I assume one could hit this -- when really
>> trying hard to achieve it. (probably as privileged user to get a big
>> VM_PFNMAP, though, but not sure)
>
> Right ok, yeah I guess so. It'd be good to see if we could trigger it somehow :)
>
>>
>> In that case, a WARN_ON_ONCE() would be bad -- because it could be triggered
>> by the user.
>
> Ack
>
>>
>> We could do a pr_warn_once() instead, stating that this is not supported
>> right now?
>
> Hmm, if we truly think it might happen let's avoid printing anything for now.
>
> Maybe just ++todo for experimenting with triggering?
>
> It's not hugely important!
Agreed. I assume it's similar to our mapcount vs. refcount overflows. if
you really want to trigger it, there might be some weird way .. but it's
no longer in the "valid use case" area :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
` (10 preceding siblings ...)
2025-05-12 12:34 ` [PATCH v2 11/11] mm/io-mapping: " David Hildenbrand
@ 2025-05-13 15:53 ` Liam R. Howlett
2025-05-13 17:17 ` David Hildenbrand
11 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 15:53 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> On top of mm-unstable.
>
> VM_PAT annoyed me too much and wasted too much of my time, let's clean
> PAT handling up and remove VM_PAT.
>
> This should sort out various issues with VM_PAT we discovered recently,
> and will hopefully make the whole code more stable and easier to maintain.
>
> In essence: we stop letting PAT mode mess with VMAs and instead lift
> what to track/untrack to the MM core. We remember per VMA which pfn range
> we tracked in a new struct we attach to a VMA (we have space without
> exceeding 192 bytes), use a kref to share it among VMAs during
> split/mremap/fork, and automatically untrack once the kref drops to 0.
What you do here seems to be decouple the vma start/end addresses by
abstracting them into another allocated ref counted struct. This is
close to what we do with the anon vma name..
It took a while to understand the underlying interval tree tracking of
this change, but I think it's as good as it was. IIRC, there was a
shrinking and matching to the end address in the interval tree, but I
failed to find that commit and code - maybe it never made it upstream.
I was able to find a thread about splitting [1], so maybe I'm mistaken.
>
> This implies that we'll keep tracking a full pfn range even after partially
> unmapping it, until fully unmapping it; but as that case was mostly broken
> before, this at least makes it work in a way that is least intrusive to
> VMA handling.
>
> Shrinking with mremap() used to work in a hacky way, now we'll similarly
> keep the original pfn range tacked even after this form of partial unmap.
> Does anybody care about that? Unlikely. If we run into issues, we could
> likely handled that (adjust the tracking) when our kref drops to 1 while
> freeing a VMA. But it adds more complexity, so avoid that for now.
The decoupling of the vma and ref counted range means that we could beef
up the backend to support actually tracking the correct range, which
would be nice.. but I have very little desire to work on that.
[1] https://lore.kernel.org/all/5jrd43vusvcchpk2x6mouighkfhamjpaya5fu2cvikzaieg5pq@wqccwmjs4ian/
>
> Briefly tested with the new pfnmap selftests [1].
>
> [1] https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com
oh yes, that's still a pr_info() log. I think that should be a pr_err()
at least?
>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tursulin@ursulin.net>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Cc: Peter Xu <peterx@redhat.com>
>
> v1 -> v2:
> * "mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()"
> -> Call it "pfnmap_setup_cachemode()" and improve the documentation
> -> Add pfnmap_setup_cachemode_pfn()
> -> Keep checking a single PFN for PMD/PUD case and document why it's ok
> * Merged memremap conversion patch with pfnmap_track() introduction patch
> -> Improve documentation
> * "mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()"
> -> Adjust to code changes in mm-unstable
> * Added "x86/mm/pat: inline memtype_match() into memtype_erase()"
> * "mm/io-mapping: track_pfn() -> "pfnmap tracking""
> -> Adjust to code changes in mm-unstable
>
> David Hildenbrand (11):
> x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
> mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()
> mm: introduce pfnmap_track() and pfnmap_untrack() and use them for
> memremap
> mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
> x86/mm/pat: remove old pfnmap tracking interface
> mm: remove VM_PAT
> x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
> x86/mm/pat: remove MEMTYPE_*_MATCH
> x86/mm/pat: inline memtype_match() into memtype_erase()
> drm/i915: track_pfn() -> "pfnmap tracking"
> mm/io-mapping: track_pfn() -> "pfnmap tracking"
>
> arch/x86/mm/pat/memtype.c | 194 ++++-------------------------
> arch/x86/mm/pat/memtype_interval.c | 63 ++--------
> drivers/gpu/drm/i915/i915_mm.c | 4 +-
> include/linux/mm.h | 4 +-
> include/linux/mm_inline.h | 2 +
> include/linux/mm_types.h | 11 ++
> include/linux/pgtable.h | 127 ++++++++++---------
> include/trace/events/mmflags.h | 4 +-
> mm/huge_memory.c | 5 +-
> mm/io-mapping.c | 2 +-
> mm/memory.c | 86 ++++++++++---
> mm/memremap.c | 8 +-
> mm/mmap.c | 5 -
> mm/mremap.c | 4 -
> mm/vma_init.c | 50 ++++++++
> 15 files changed, 242 insertions(+), 327 deletions(-)
>
>
> base-commit: c68cfbc5048ede4b10a1d3fe16f7f6192fc2c9c8
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT
2025-05-13 15:53 ` [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Liam R. Howlett
@ 2025-05-13 17:17 ` David Hildenbrand
0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-13 17:17 UTC (permalink / raw)
To: Liam R. Howlett, linux-kernel, linux-mm, x86, intel-gfx,
dri-devel, linux-trace-kernel, Dave Hansen, Andy Lutomirski,
Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
H. Peter Anvin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, David Airlie, Simona Vetter, Andrew Morton,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu
On 13.05.25 17:53, Liam R. Howlett wrote:
> * David Hildenbrand <david@redhat.com> [250512 08:34]:
>> On top of mm-unstable.
>>
>> VM_PAT annoyed me too much and wasted too much of my time, let's clean
>> PAT handling up and remove VM_PAT.
>>
>> This should sort out various issues with VM_PAT we discovered recently,
>> and will hopefully make the whole code more stable and easier to maintain.
>>
>> In essence: we stop letting PAT mode mess with VMAs and instead lift
>> what to track/untrack to the MM core. We remember per VMA which pfn range
>> we tracked in a new struct we attach to a VMA (we have space without
>> exceeding 192 bytes), use a kref to share it among VMAs during
>> split/mremap/fork, and automatically untrack once the kref drops to 0.
>
> What you do here seems to be decouple the vma start/end addresses by
> abstracting them into another allocated ref counted struct. This is
> close to what we do with the anon vma name..
Yes, inspired by that.
>
> It took a while to understand the underlying interval tree tracking of
> this change, but I think it's as good as it was. IIRC, there was a
> shrinking and matching to the end address in the interval tree, but I
> failed to find that commit and code - maybe it never made it upstream.
> I was able to find a thread about splitting [1], so maybe I'm mistaken.
There was hidden code that kept a memremap() shrinking working
(adjusting the tracked range).
The leftovers are removed in patch #8.
See below.
>
>>
>> This implies that we'll keep tracking a full pfn range even after partially
>> unmapping it, until fully unmapping it; but as that case was mostly broken
>> before, this at least makes it work in a way that is least intrusive to
>> VMA handling.
>>
>> Shrinking with mremap() used to work in a hacky way, now we'll similarly
>> keep the original pfn range tacked even after this form of partial unmap.
>> Does anybody care about that? Unlikely. If we run into issues, we could
>> likely handled that (adjust the tracking) when our kref drops to 1 while
>> freeing a VMA. But it adds more complexity, so avoid that for now.
>
> The decoupling of the vma and ref counted range means that we could beef
> up the backend to support actually tracking the correct range, which
> would be nice..
Right, in patch #4 I have
"
This change implies that we'll keep tracking the original PFN range even
after splitting + partially unmapping it: not too bad, because it was
not working reliably before. The only thing that kind-of worked before
was shrinking such a mapping using mremap(): we managed to adjust the
reservation in a hacky way, now we won't adjust the reservation but
leave it around until all involved VMAs are gone.
If that ever turns out to be an issue, we could hook into VM splitting
code and split the tracking; however, that adds complexity that might
not be required, so we'll keep it simple for now.
"
Duplicating/moving/forking VMAs is now definitely better than before.
Splitting is also arguably better than before -- even a simple partial
munmap() [1] is currently problematic, unless we're munmapping the last
part of a VMA (-> shrinking).
Implementing splitting properly is a bit complicated if the pnfmap ctx
has more than one ref, but it could be added if ever really required.
[1] https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com
> but I have very little desire to work on that.
Jap :)
>
>
> [1] https://lore.kernel.org/all/5jrd43vusvcchpk2x6mouighkfhamjpaya5fu2cvikzaieg5pq@wqccwmjs4ian/
>
>>
>> Briefly tested with the new pfnmap selftests [1].
>>
>> [1] https://lkml.kernel.org/r/20250509153033.952746-1-david@redhat.com
>
> oh yes, that's still a pr_info() log. I think that should be a pr_err()
> at least?
I was wondering if that is actually a WARN_ON_ONCE(). Now, it should be
much harder to actually trigger.
Thanks!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode()
2025-05-12 12:34 ` [PATCH v2 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
@ 2025-05-13 17:29 ` Liam R. Howlett
0 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:29 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> Let's factor it out to make the code easier to grasp. Drop one comment
> where it is now rather obvious what is happening.
>
> Use it also in pgprot_writecombine()/pgprot_writethrough() where
> clearing the old cachemode might not be required, but given that we are
> already doing a function call, no need to care about this
> micro-optimization.
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> arch/x86/mm/pat/memtype.c | 33 +++++++++++++++------------------
> 1 file changed, 15 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 72d8cbc611583..edec5859651d6 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -800,6 +800,12 @@ static inline int range_is_allowed(unsigned long pfn, unsigned long size)
> }
> #endif /* CONFIG_STRICT_DEVMEM */
>
> +static inline void pgprot_set_cachemode(pgprot_t *prot, enum page_cache_mode pcm)
> +{
> + *prot = __pgprot((pgprot_val(*prot) & ~_PAGE_CACHE_MASK) |
> + cachemode2protval(pcm));
> +}
> +
> int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
> unsigned long size, pgprot_t *vma_prot)
> {
> @@ -811,8 +817,7 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
> if (file->f_flags & O_DSYNC)
> pcm = _PAGE_CACHE_MODE_UC_MINUS;
>
> - *vma_prot = __pgprot((pgprot_val(*vma_prot) & ~_PAGE_CACHE_MASK) |
> - cachemode2protval(pcm));
> + pgprot_set_cachemode(vma_prot, pcm);
> return 1;
> }
>
> @@ -880,9 +885,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
> (unsigned long long)paddr,
> (unsigned long long)(paddr + size - 1),
> cattr_name(pcm));
> - *vma_prot = __pgprot((pgprot_val(*vma_prot) &
> - (~_PAGE_CACHE_MASK)) |
> - cachemode2protval(pcm));
> + pgprot_set_cachemode(vma_prot, pcm);
> }
> return 0;
> }
> @@ -907,9 +910,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
> * We allow returning different type than the one requested in
> * non strict case.
> */
> - *vma_prot = __pgprot((pgprot_val(*vma_prot) &
> - (~_PAGE_CACHE_MASK)) |
> - cachemode2protval(pcm));
> + pgprot_set_cachemode(vma_prot, pcm);
> }
>
> if (memtype_kernel_map_sync(paddr, size, pcm) < 0) {
> @@ -1060,9 +1061,7 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> return -EINVAL;
> }
>
> - *prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
> - cachemode2protval(pcm));
> -
> + pgprot_set_cachemode(prot, pcm);
> return 0;
> }
>
> @@ -1073,10 +1072,8 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
> if (!pat_enabled())
> return;
>
> - /* Set prot based on lookup */
> pcm = lookup_memtype(pfn_t_to_phys(pfn));
> - *prot = __pgprot((pgprot_val(*prot) & (~_PAGE_CACHE_MASK)) |
> - cachemode2protval(pcm));
> + pgprot_set_cachemode(prot, pcm);
> }
>
> /*
> @@ -1115,15 +1112,15 @@ void untrack_pfn_clear(struct vm_area_struct *vma)
>
> pgprot_t pgprot_writecombine(pgprot_t prot)
> {
> - return __pgprot(pgprot_val(prot) |
> - cachemode2protval(_PAGE_CACHE_MODE_WC));
> + pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC);
> + return prot;
> }
> EXPORT_SYMBOL_GPL(pgprot_writecombine);
>
> pgprot_t pgprot_writethrough(pgprot_t prot)
> {
> - return __pgprot(pgprot_val(prot) |
> - cachemode2protval(_PAGE_CACHE_MODE_WT));
> + pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WT);
> + return prot;
> }
> EXPORT_SYMBOL_GPL(pgprot_writethrough);
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*()
2025-05-12 12:34 ` [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*() David Hildenbrand
2025-05-12 15:43 ` Lorenzo Stoakes
@ 2025-05-13 17:29 ` Liam R. Howlett
1 sibling, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:29 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> ... by factoring it out from track_pfn_remap() into
> pfnmap_setup_cachemode() and provide pfnmap_setup_cachemode_pfn() as
> a replacement for track_pfn_insert().
>
> For PMDs/PUDs, we keep checking a single pfn only. Add some documentation,
> and also document why it is valid to not check the whole pfn range.
>
> We'll reuse pfnmap_setup_cachemode() from core MM next.
>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> arch/x86/mm/pat/memtype.c | 24 ++++++------------
> include/linux/pgtable.h | 52 +++++++++++++++++++++++++++++++++------
> mm/huge_memory.c | 5 ++--
> mm/memory.c | 4 +--
> 4 files changed, 57 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index edec5859651d6..fa78facc6f633 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1031,7 +1031,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> unsigned long pfn, unsigned long addr, unsigned long size)
> {
> resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> - enum page_cache_mode pcm;
>
> /* reserve the whole chunk starting from paddr */
> if (!vma || (addr == vma->vm_start
> @@ -1044,13 +1043,17 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> return ret;
> }
>
> + return pfnmap_setup_cachemode(pfn, size, prot);
> +}
> +
> +int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot)
> +{
> + resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> + enum page_cache_mode pcm;
> +
> if (!pat_enabled())
> return 0;
>
> - /*
> - * For anything smaller than the vma size we set prot based on the
> - * lookup.
> - */
> pcm = lookup_memtype(paddr);
>
> /* Check memtype for the remaining pages */
> @@ -1065,17 +1068,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> return 0;
> }
>
> -void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
> -{
> - enum page_cache_mode pcm;
> -
> - if (!pat_enabled())
> - return;
> -
> - pcm = lookup_memtype(pfn_t_to_phys(pfn));
> - pgprot_set_cachemode(prot, pcm);
> -}
> -
> /*
> * untrack_pfn is called while unmapping a pfnmap for a region.
> * untrack can be called for a specific region indicated by pfn and size or
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f1e890b604609..be1745839871c 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1496,13 +1496,10 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> return 0;
> }
>
> -/*
> - * track_pfn_insert is called when a _new_ single pfn is established
> - * by vmf_insert_pfn().
> - */
> -static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> - pfn_t pfn)
> +static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
> + pgprot_t *prot)
> {
> + return 0;
> }
>
> /*
> @@ -1552,8 +1549,32 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> unsigned long pfn, unsigned long addr,
> unsigned long size);
> -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> - pfn_t pfn);
> +
> +/**
> + * pfnmap_setup_cachemode - setup the cachemode in the pgprot for a pfn range
> + * @pfn: the start of the pfn range
> + * @size: the size of the pfn range in bytes
> + * @prot: the pgprot to modify
> + *
> + * Lookup the cachemode for the pfn range starting at @pfn with the size
> + * @size and store it in @prot, leaving other data in @prot unchanged.
> + *
> + * This allows for a hardware implementation to have fine-grained control of
> + * memory cache behavior at page level granularity. Without a hardware
> + * implementation, this function does nothing.
> + *
> + * Currently there is only one implementation for this - x86 Page Attribute
> + * Table (PAT). See Documentation/arch/x86/pat.rst for more details.
> + *
> + * This function can fail if the pfn range spans pfns that require differing
> + * cachemodes. If the pfn range was previously verified to have a single
> + * cachemode, it is sufficient to query only a single pfn. The assumption is
> + * that this is the case for drivers using the vmf_insert_pfn*() interface.
> + *
> + * Returns 0 on success and -EINVAL on error.
> + */
> +int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
> + pgprot_t *prot);
> extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> struct vm_area_struct *src_vma, unsigned long *pfn);
> extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> @@ -1563,6 +1584,21 @@ extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> extern void untrack_pfn_clear(struct vm_area_struct *vma);
> #endif
>
> +/**
> + * pfnmap_setup_cachemode_pfn - setup the cachemode in the pgprot for a pfn
> + * @pfn: the pfn
> + * @prot: the pgprot to modify
> + *
> + * Lookup the cachemode for @pfn and store it in @prot, leaving other
> + * data in @prot unchanged.
> + *
> + * See pfnmap_setup_cachemode() for details.
> + */
> +static inline void pfnmap_setup_cachemode_pfn(unsigned long pfn, pgprot_t *prot)
> +{
> + pfnmap_setup_cachemode(pfn, PAGE_SIZE, prot);
> +}
> +
> #ifdef CONFIG_MMU
> #ifdef __HAVE_COLOR_ZERO_PAGE
> static inline int is_zero_pfn(unsigned long pfn)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2780a12b25f01..d3e66136e41a3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1455,7 +1455,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> return VM_FAULT_OOM;
> }
>
> - track_pfn_insert(vma, &pgprot, pfn);
> + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
> +
> ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
> pgtable);
> @@ -1577,7 +1578,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
> if (addr < vma->vm_start || addr >= vma->vm_end)
> return VM_FAULT_SIGBUS;
>
> - track_pfn_insert(vma, &pgprot, pfn);
> + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>
> ptl = pud_lock(vma->vm_mm, vmf->pud);
> insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> diff --git a/mm/memory.c b/mm/memory.c
> index 99af83434e7c5..064fc55d8eab9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2564,7 +2564,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
> if (!pfn_modify_allowed(pfn, pgprot))
> return VM_FAULT_SIGBUS;
>
> - track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
> + pfnmap_setup_cachemode_pfn(pfn, &pgprot);
>
> return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
> false);
> @@ -2627,7 +2627,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
> if (addr < vma->vm_start || addr >= vma->vm_end)
> return VM_FAULT_SIGBUS;
>
> - track_pfn_insert(vma, &pgprot, pfn);
> + pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>
> if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
> return VM_FAULT_SIGBUS;
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 03/11] mm: introduce pfnmap_track() and pfnmap_untrack() and use them for memremap
2025-05-12 12:34 ` [PATCH v2 03/11] mm: introduce pfnmap_track() and pfnmap_untrack() and use them for memremap David Hildenbrand
@ 2025-05-13 17:40 ` Liam R. Howlett
2025-05-14 17:57 ` David Hildenbrand
0 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:40 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> Let's provide variants of track_pfn_remap() and untrack_pfn() that won't
> mess with VMAs, and replace the usage in mm/memremap.c.
>
> Add some documentation.
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Small nit with this one, but either way:
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> arch/x86/mm/pat/memtype.c | 14 ++++++++++++++
> include/linux/pgtable.h | 39 +++++++++++++++++++++++++++++++++++++++
> mm/memremap.c | 8 ++++----
> 3 files changed, 57 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index fa78facc6f633..1ec8af6cad6bf 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1068,6 +1068,20 @@ int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot
> return 0;
> }
>
> +int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
> +{
> + const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
Here, the << PAGE_SHIFT isn't really needed, because..
> +
> + return reserve_pfn_range(paddr, size, prot, 0);
> +}
> +
> +void pfnmap_untrack(unsigned long pfn, unsigned long size)
> +{
> + const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> +
> + free_pfn_range(paddr, size);
> +}
> +
> /*
> * untrack_pfn is called while unmapping a pfnmap for a region.
> * untrack can be called for a specific region indicated by pfn and size or
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index be1745839871c..90f72cd358390 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1502,6 +1502,16 @@ static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
> return 0;
> }
>
> +static inline int pfnmap_track(unsigned long pfn, unsigned long size,
> + pgprot_t *prot)
> +{
> + return 0;
> +}
> +
> +static inline void pfnmap_untrack(unsigned long pfn, unsigned long size)
> +{
> +}
> +
> /*
> * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
> * tables copied during copy_page_range(). Will store the pfn to be
> @@ -1575,6 +1585,35 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> */
> int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
> pgprot_t *prot);
> +
> +/**
> + * pfnmap_track - track a pfn range
> + * @pfn: the start of the pfn range
> + * @size: the size of the pfn range in bytes
> + * @prot: the pgprot to track
> + *
> + * Requested the pfn range to be 'tracked' by a hardware implementation and
> + * setup the cachemode in @prot similar to pfnmap_setup_cachemode().
> + *
> + * This allows for fine-grained control of memory cache behaviour at page
> + * level granularity. Tracking memory this way is persisted across VMA splits
> + * (VMA merging does not apply for VM_PFNMAP).
> + *
> + * Currently, there is only one implementation for this - x86 Page Attribute
> + * Table (PAT). See Documentation/arch/x86/pat.rst for more details.
> + *
> + * Returns 0 on success and -EINVAL on error.
> + */
> +int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
> +
> +/**
> + * pfnmap_untrack - untrack a pfn range
> + * @pfn: the start of the pfn range
> + * @size: the size of the pfn range in bytes
> + *
> + * Untrack a pfn range previously tracked through pfnmap_track().
> + */
> +void pfnmap_untrack(unsigned long pfn, unsigned long size);
> extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> struct vm_area_struct *src_vma, unsigned long *pfn);
> extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 2aebc1b192da9..c417c843e9b1f 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -130,7 +130,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
> }
> mem_hotplug_done();
>
> - untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> + pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
> pgmap_array_delete(range);
> }
>
> @@ -211,8 +211,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> if (nid < 0)
> nid = numa_mem_id();
>
> - error = track_pfn_remap(NULL, ¶ms->pgprot, PHYS_PFN(range->start), 0,
> - range_len(range));
> + error = pfnmap_track(PHYS_PFN(range->start), range_len(range),
This user (of two) converts the range->start to the pfn.
The other user is pfnmap_track_ctx_alloc() in mm/memory.c which is
called from remap_pfn_range(), which also has addr.
Couldn't we just use the address directly?
I think the same holds for untrack as well.
> + ¶ms->pgprot);
> if (error)
> goto err_pfn_remap;
>
> @@ -277,7 +277,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> if (!is_private)
> kasan_remove_zero_shadow(__va(range->start), range_len(range));
> err_kasan:
> - untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
> + pfnmap_untrack(PHYS_PFN(range->start), range_len(range));
> err_pfn_remap:
> pgmap_array_delete(range);
> return error;
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()
2025-05-12 12:34 ` [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack() David Hildenbrand
2025-05-12 16:42 ` Lorenzo Stoakes
@ 2025-05-13 17:42 ` Liam R. Howlett
1 sibling, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> Let's use our new interface. In remap_pfn_range(), we'll now decide
> whether we have to track (full VMA covered) or only lookup the
> cachemode (partial VMA covered).
>
> Remember what we have to untrack by linking it from the VMA. When
> duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
> to anon VMA names, and use a kref to share the tracking.
>
> Once the last VMA un-refs our tracking data, we'll do the untracking,
> which simplifies things a lot and should sort our various issues we saw
> recently, for example, when partially unmapping/zapping a tracked VMA.
>
> This change implies that we'll keep tracking the original PFN range even
> after splitting + partially unmapping it: not too bad, because it was
> not working reliably before. The only thing that kind-of worked before
> was shrinking such a mapping using mremap(): we managed to adjust the
> reservation in a hacky way, now we won't adjust the reservation but
> leave it around until all involved VMAs are gone.
>
> If that ever turns out to be an issue, we could hook into VM splitting
> code and split the tracking; however, that adds complexity that might
> not be required, so we'll keep it simple for now.
>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> include/linux/mm_inline.h | 2 +
> include/linux/mm_types.h | 11 ++++++
> mm/memory.c | 82 +++++++++++++++++++++++++++++++--------
> mm/mmap.c | 5 ---
> mm/mremap.c | 4 --
> mm/vma_init.c | 50 ++++++++++++++++++++++++
> 6 files changed, 129 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f9157a0c42a5c..89b518ff097e6 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -447,6 +447,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
>
> #endif /* CONFIG_ANON_VMA_NAME */
>
> +void pfnmap_track_ctx_release(struct kref *ref);
> +
> static inline void init_tlb_flush_pending(struct mm_struct *mm)
> {
> atomic_set(&mm->tlb_flush_pending, 0);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 15808cad2bc1a..3e934dc6057c4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -763,6 +763,14 @@ struct vma_numab_state {
> int prev_scan_seq;
> };
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +struct pfnmap_track_ctx {
> + struct kref kref;
> + unsigned long pfn;
> + unsigned long size; /* in bytes */
> +};
> +#endif
> +
> /*
> * Describes a VMA that is about to be mmap()'ed. Drivers may choose to
> * manipulate mutable fields which will cause those fields to be updated in the
> @@ -900,6 +908,9 @@ struct vm_area_struct {
> struct anon_vma_name *anon_name;
> #endif
> struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef __HAVE_PFNMAP_TRACKING
> + struct pfnmap_track_ctx *pfnmap_track_ctx;
> +#endif
> } __randomize_layout;
>
> #ifdef CONFIG_NUMA
> diff --git a/mm/memory.c b/mm/memory.c
> index 064fc55d8eab9..4cf4adb0de266 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1371,7 +1371,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> struct mm_struct *dst_mm = dst_vma->vm_mm;
> struct mm_struct *src_mm = src_vma->vm_mm;
> struct mmu_notifier_range range;
> - unsigned long next, pfn = 0;
> + unsigned long next;
> bool is_cow;
> int ret;
>
> @@ -1381,12 +1381,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> if (is_vm_hugetlb_page(src_vma))
> return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
>
> - if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
> - ret = track_pfn_copy(dst_vma, src_vma, &pfn);
> - if (ret)
> - return ret;
> - }
> -
> /*
> * We need to invalidate the secondary MMU mappings only when
> * there could be a permission downgrade on the ptes of the
> @@ -1428,8 +1422,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> raw_write_seqcount_end(&src_mm->write_protect_seq);
> mmu_notifier_invalidate_range_end(&range);
> }
> - if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
> - untrack_pfn_copy(dst_vma, pfn);
> return ret;
> }
>
> @@ -1924,9 +1916,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> if (vma->vm_file)
> uprobe_munmap(vma, start, end);
>
> - if (unlikely(vma->vm_flags & VM_PFNMAP))
> - untrack_pfn(vma, 0, 0, mm_wr_locked);
> -
> if (start != end) {
> if (unlikely(is_vm_hugetlb_page(vma))) {
> /*
> @@ -2872,6 +2861,36 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> return error;
> }
>
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
> + unsigned long size, pgprot_t *prot)
> +{
> + struct pfnmap_track_ctx *ctx;
> +
> + if (pfnmap_track(pfn, size, prot))
> + return ERR_PTR(-EINVAL);
> +
> + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> + if (unlikely(!ctx)) {
> + pfnmap_untrack(pfn, size);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + ctx->pfn = pfn;
> + ctx->size = size;
> + kref_init(&ctx->kref);
> + return ctx;
> +}
> +
> +void pfnmap_track_ctx_release(struct kref *ref)
> +{
> + struct pfnmap_track_ctx *ctx = container_of(ref, struct pfnmap_track_ctx, kref);
> +
> + pfnmap_untrack(ctx->pfn, ctx->size);
> + kfree(ctx);
> +}
> +#endif /* __HAVE_PFNMAP_TRACKING */
> +
> /**
> * remap_pfn_range - remap kernel memory to userspace
> * @vma: user vma to map to
> @@ -2884,20 +2903,51 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
> *
> * Return: %0 on success, negative error code otherwise.
> */
> +#ifdef __HAVE_PFNMAP_TRACKING
> int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t prot)
> {
> + struct pfnmap_track_ctx *ctx = NULL;
> int err;
>
> - err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
> - if (err)
> + size = PAGE_ALIGN(size);
> +
> + /*
> + * If we cover the full VMA, we'll perform actual tracking, and
> + * remember to untrack when the last reference to our tracking
> + * context from a VMA goes away. We'll keep tracking the whole pfn
> + * range even during VMA splits and partial unmapping.
> + *
> + * If we only cover parts of the VMA, we'll only setup the cachemode
> + * in the pgprot for the pfn range.
> + */
> + if (addr == vma->vm_start && addr + size == vma->vm_end) {
> + if (vma->pfnmap_track_ctx)
> + return -EINVAL;
> + ctx = pfnmap_track_ctx_alloc(pfn, size, &prot);
> + if (IS_ERR(ctx))
> + return PTR_ERR(ctx);
> + } else if (pfnmap_setup_cachemode(pfn, size, &prot)) {
> return -EINVAL;
> + }
>
> err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> - if (err)
> - untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
> + if (ctx) {
> + if (err)
> + kref_put(&ctx->kref, pfnmap_track_ctx_release);
> + else
> + vma->pfnmap_track_ctx = ctx;
> + }
> return err;
> }
> +
> +#else
> +int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pfn, unsigned long size, pgprot_t prot)
> +{
> + return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
> +}
> +#endif
> EXPORT_SYMBOL(remap_pfn_range);
>
> /**
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 50f902c08341a..09c563c951123 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1784,11 +1784,6 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> tmp = vm_area_dup(mpnt);
> if (!tmp)
> goto fail_nomem;
> -
> - /* track_pfn_copy() will later take care of copying internal state. */
> - if (unlikely(tmp->vm_flags & VM_PFNMAP))
> - untrack_pfn_clear(tmp);
> -
> retval = vma_dup_policy(mpnt, tmp);
> if (retval)
> goto fail_nomem_policy;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 7db9da609c84f..6e78e02f74bd3 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1191,10 +1191,6 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> if (is_vm_hugetlb_page(vma))
> clear_vma_resv_huge_pages(vma);
>
> - /* Tell pfnmap has moved from this vma */
> - if (unlikely(vma->vm_flags & VM_PFNMAP))
> - untrack_pfn_clear(vma);
> -
> *new_vma_ptr = new_vma;
> return err;
> }
> diff --git a/mm/vma_init.c b/mm/vma_init.c
> index 967ca85179864..8e53c7943561e 100644
> --- a/mm/vma_init.c
> +++ b/mm/vma_init.c
> @@ -71,7 +71,51 @@ static void vm_area_init_from(const struct vm_area_struct *src,
> #ifdef CONFIG_NUMA
> dest->vm_policy = src->vm_policy;
> #endif
> +#ifdef __HAVE_PFNMAP_TRACKING
> + dest->pfnmap_track_ctx = NULL;
> +#endif
> +}
> +
> +#ifdef __HAVE_PFNMAP_TRACKING
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> + struct vm_area_struct *new)
> +{
> + struct pfnmap_track_ctx *ctx = orig->pfnmap_track_ctx;
> +
> + if (likely(!ctx))
> + return 0;
> +
> + /*
> + * We don't expect to ever hit this. If ever required, we would have
> + * to duplicate the tracking.
> + */
> + if (unlikely(kref_read(&ctx->kref) >= REFCOUNT_MAX))
> + return -ENOMEM;
> + kref_get(&ctx->kref);
> + new->pfnmap_track_ctx = ctx;
> + return 0;
> +}
> +
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> + struct pfnmap_track_ctx *ctx = vma->pfnmap_track_ctx;
> +
> + if (likely(!ctx))
> + return;
> +
> + kref_put(&ctx->kref, pfnmap_track_ctx_release);
> + vma->pfnmap_track_ctx = NULL;
> +}
> +#else
> +static inline int vma_pfnmap_track_ctx_dup(struct vm_area_struct *orig,
> + struct vm_area_struct *new)
> +{
> + return 0;
> }
> +static inline void vma_pfnmap_track_ctx_release(struct vm_area_struct *vma)
> +{
> +}
> +#endif
>
> struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> {
> @@ -83,6 +127,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
> vm_area_init_from(orig, new);
> +
> + if (vma_pfnmap_track_ctx_dup(orig, new)) {
> + kmem_cache_free(vm_area_cachep, new);
> + return NULL;
> + }
> vma_lock_init(new, true);
> INIT_LIST_HEAD(&new->anon_vma_chain);
> vma_numab_state_init(new);
> @@ -97,5 +146,6 @@ void vm_area_free(struct vm_area_struct *vma)
> vma_assert_detached(vma);
> vma_numab_state_free(vma);
> free_anon_vma_name(vma);
> + vma_pfnmap_track_ctx_release(vma);
> kmem_cache_free(vm_area_cachep, vma);
> }
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 05/11] x86/mm/pat: remove old pfnmap tracking interface
2025-05-12 12:34 ` [PATCH v2 05/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
@ 2025-05-13 17:42 ` Liam R. Howlett
0 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> We can now get rid of the old interface along with get_pat_info() and
> follow_phys().
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> arch/x86/mm/pat/memtype.c | 147 --------------------------------------
> include/linux/pgtable.h | 66 -----------------
> 2 files changed, 213 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 1ec8af6cad6bf..c88d1cbdc1de1 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -933,119 +933,6 @@ static void free_pfn_range(u64 paddr, unsigned long size)
> memtype_free(paddr, paddr + size);
> }
>
> -static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
> - resource_size_t *phys)
> -{
> - struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start };
> -
> - if (follow_pfnmap_start(&args))
> - return -EINVAL;
> -
> - /* Never return PFNs of anon folios in COW mappings. */
> - if (!args.special) {
> - follow_pfnmap_end(&args);
> - return -EINVAL;
> - }
> -
> - *prot = pgprot_val(args.pgprot);
> - *phys = (resource_size_t)args.pfn << PAGE_SHIFT;
> - follow_pfnmap_end(&args);
> - return 0;
> -}
> -
> -static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
> - pgprot_t *pgprot)
> -{
> - unsigned long prot;
> -
> - VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT));
> -
> - /*
> - * We need the starting PFN and cachemode used for track_pfn_remap()
> - * that covered the whole VMA. For most mappings, we can obtain that
> - * information from the page tables. For COW mappings, we might now
> - * suddenly have anon folios mapped and follow_phys() will fail.
> - *
> - * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to
> - * detect the PFN. If we need the cachemode as well, we're out of luck
> - * for now and have to fail fork().
> - */
> - if (!follow_phys(vma, &prot, paddr)) {
> - if (pgprot)
> - *pgprot = __pgprot(prot);
> - return 0;
> - }
> - if (is_cow_mapping(vma->vm_flags)) {
> - if (pgprot)
> - return -EINVAL;
> - *paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT;
> - return 0;
> - }
> - WARN_ON_ONCE(1);
> - return -EINVAL;
> -}
> -
> -int track_pfn_copy(struct vm_area_struct *dst_vma,
> - struct vm_area_struct *src_vma, unsigned long *pfn)
> -{
> - const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
> - resource_size_t paddr;
> - pgprot_t pgprot;
> - int rc;
> -
> - if (!(src_vma->vm_flags & VM_PAT))
> - return 0;
> -
> - /*
> - * Duplicate the PAT information for the dst VMA based on the src
> - * VMA.
> - */
> - if (get_pat_info(src_vma, &paddr, &pgprot))
> - return -EINVAL;
> - rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
> - if (rc)
> - return rc;
> -
> - /* Reservation for the destination VMA succeeded. */
> - vm_flags_set(dst_vma, VM_PAT);
> - *pfn = PHYS_PFN(paddr);
> - return 0;
> -}
> -
> -void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
> -{
> - untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
> - /*
> - * Reservation was freed, any copied page tables will get cleaned
> - * up later, but without getting PAT involved again.
> - */
> -}
> -
> -/*
> - * prot is passed in as a parameter for the new mapping. If the vma has
> - * a linear pfn mapping for the entire range, or no vma is provided,
> - * reserve the entire pfn + size range with single reserve_pfn_range
> - * call.
> - */
> -int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> - unsigned long pfn, unsigned long addr, unsigned long size)
> -{
> - resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> -
> - /* reserve the whole chunk starting from paddr */
> - if (!vma || (addr == vma->vm_start
> - && size == (vma->vm_end - vma->vm_start))) {
> - int ret;
> -
> - ret = reserve_pfn_range(paddr, size, prot, 0);
> - if (ret == 0 && vma)
> - vm_flags_set(vma, VM_PAT);
> - return ret;
> - }
> -
> - return pfnmap_setup_cachemode(pfn, size, prot);
> -}
> -
> int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size, pgprot_t *prot)
> {
> resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> @@ -1082,40 +969,6 @@ void pfnmap_untrack(unsigned long pfn, unsigned long size)
> free_pfn_range(paddr, size);
> }
>
> -/*
> - * untrack_pfn is called while unmapping a pfnmap for a region.
> - * untrack can be called for a specific region indicated by pfn and size or
> - * can be for the entire vma (in which case pfn, size are zero).
> - */
> -void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> - unsigned long size, bool mm_wr_locked)
> -{
> - resource_size_t paddr;
> -
> - if (vma && !(vma->vm_flags & VM_PAT))
> - return;
> -
> - /* free the chunk starting from pfn or the whole chunk */
> - paddr = (resource_size_t)pfn << PAGE_SHIFT;
> - if (!paddr && !size) {
> - if (get_pat_info(vma, &paddr, NULL))
> - return;
> - size = vma->vm_end - vma->vm_start;
> - }
> - free_pfn_range(paddr, size);
> - if (vma) {
> - if (mm_wr_locked)
> - vm_flags_clear(vma, VM_PAT);
> - else
> - __vm_flags_mod(vma, 0, VM_PAT);
> - }
> -}
> -
> -void untrack_pfn_clear(struct vm_area_struct *vma)
> -{
> - vm_flags_clear(vma, VM_PAT);
> -}
> -
> pgprot_t pgprot_writecombine(pgprot_t prot)
> {
> pgprot_set_cachemode(&prot, _PAGE_CACHE_MODE_WC);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 90f72cd358390..0b6e1f781d86d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1485,17 +1485,6 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
> * vmf_insert_pfn.
> */
>
> -/*
> - * track_pfn_remap is called when a _new_ pfn mapping is being established
> - * by remap_pfn_range() for physical range indicated by pfn and size.
> - */
> -static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> - unsigned long pfn, unsigned long addr,
> - unsigned long size)
> -{
> - return 0;
> -}
> -
> static inline int pfnmap_setup_cachemode(unsigned long pfn, unsigned long size,
> pgprot_t *prot)
> {
> @@ -1511,55 +1500,7 @@ static inline int pfnmap_track(unsigned long pfn, unsigned long size,
> static inline void pfnmap_untrack(unsigned long pfn, unsigned long size)
> {
> }
> -
> -/*
> - * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
> - * tables copied during copy_page_range(). Will store the pfn to be
> - * passed to untrack_pfn_copy() only if there is something to be untracked.
> - * Callers should initialize the pfn to 0.
> - */
> -static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
> - struct vm_area_struct *src_vma, unsigned long *pfn)
> -{
> - return 0;
> -}
> -
> -/*
> - * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
> - * copy_page_range(), but after track_pfn_copy() was already called. Can
> - * be called even if track_pfn_copy() did not actually track anything:
> - * handled internally.
> - */
> -static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> - unsigned long pfn)
> -{
> -}
> -
> -/*
> - * untrack_pfn is called while unmapping a pfnmap for a region.
> - * untrack can be called for a specific region indicated by pfn and size or
> - * can be for the entire vma (in which case pfn, size are zero).
> - */
> -static inline void untrack_pfn(struct vm_area_struct *vma,
> - unsigned long pfn, unsigned long size,
> - bool mm_wr_locked)
> -{
> -}
> -
> -/*
> - * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
> - *
> - * 1) During mremap() on the src VMA after the page tables were moved.
> - * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
> - */
> -static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> -{
> -}
> #else
> -extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> - unsigned long pfn, unsigned long addr,
> - unsigned long size);
> -
> /**
> * pfnmap_setup_cachemode - setup the cachemode in the pgprot for a pfn range
> * @pfn: the start of the pfn range
> @@ -1614,13 +1555,6 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot);
> * Untrack a pfn range previously tracked through pfnmap_track().
> */
> void pfnmap_untrack(unsigned long pfn, unsigned long size);
> -extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> - struct vm_area_struct *src_vma, unsigned long *pfn);
> -extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> - unsigned long pfn);
> -extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> - unsigned long size, bool mm_wr_locked);
> -extern void untrack_pfn_clear(struct vm_area_struct *vma);
> #endif
>
> /**
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 06/11] mm: remove VM_PAT
2025-05-12 12:34 ` [PATCH v2 06/11] mm: remove VM_PAT David Hildenbrand
@ 2025-05-13 17:42 ` Liam R. Howlett
0 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> It's unused, so let's remove it.
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> include/linux/mm.h | 4 +---
> include/trace/events/mmflags.h | 4 +---
> 2 files changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 38e16c984b9a6..c4efa9b17655e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -357,9 +357,7 @@ extern unsigned int kobjsize(const void *objp);
> # define VM_SHADOW_STACK VM_NONE
> #endif
>
> -#if defined(CONFIG_X86)
> -# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
> -#elif defined(CONFIG_PPC64)
> +#if defined(CONFIG_PPC64)
> # define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
> #elif defined(CONFIG_PARISC)
> # define VM_GROWSUP VM_ARCH_1
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 15aae955a10bf..aa441f593e9a6 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -172,9 +172,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
> __def_pageflag_names \
> ) : "none"
>
> -#if defined(CONFIG_X86)
> -#define __VM_ARCH_SPECIFIC_1 {VM_PAT, "pat" }
> -#elif defined(CONFIG_PPC64)
> +#if defined(CONFIG_PPC64)
> #define __VM_ARCH_SPECIFIC_1 {VM_SAO, "sao" }
> #elif defined(CONFIG_PARISC)
> #define __VM_ARCH_SPECIFIC_1 {VM_GROWSUP, "growsup" }
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 07/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range()
2025-05-12 12:34 ` [PATCH v2 07/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
@ 2025-05-13 17:43 ` Liam R. Howlett
0 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:43 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> Always set to 0, so let's remove it.
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> arch/x86/mm/pat/memtype.c | 12 +++---------
> 1 file changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index c88d1cbdc1de1..ccc55c00b4c8b 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -858,8 +858,7 @@ int memtype_kernel_map_sync(u64 base, unsigned long size,
> * Reserved non RAM regions only and after successful memtype_reserve,
> * this func also keeps identity mapping (if any) in sync with this new prot.
> */
> -static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
> - int strict_prot)
> +static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot)
> {
> int is_ram = 0;
> int ret;
> @@ -895,8 +894,7 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
> return ret;
>
> if (pcm != want_pcm) {
> - if (strict_prot ||
> - !is_new_memtype_allowed(paddr, size, want_pcm, pcm)) {
> + if (!is_new_memtype_allowed(paddr, size, want_pcm, pcm)) {
> memtype_free(paddr, paddr + size);
> pr_err("x86/PAT: %s:%d map pfn expected mapping type %s for [mem %#010Lx-%#010Lx], got %s\n",
> current->comm, current->pid,
> @@ -906,10 +904,6 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
> cattr_name(pcm));
> return -EINVAL;
> }
> - /*
> - * We allow returning different type than the one requested in
> - * non strict case.
> - */
> pgprot_set_cachemode(vma_prot, pcm);
> }
>
> @@ -959,7 +953,7 @@ int pfnmap_track(unsigned long pfn, unsigned long size, pgprot_t *prot)
> {
> const resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
>
> - return reserve_pfn_range(paddr, size, prot, 0);
> + return reserve_pfn_range(paddr, size, prot);
> }
>
> void pfnmap_untrack(unsigned long pfn, unsigned long size)
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 08/11] x86/mm/pat: remove MEMTYPE_*_MATCH
2025-05-12 12:34 ` [PATCH v2 08/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
@ 2025-05-13 17:48 ` Liam R. Howlett
2025-05-14 17:53 ` David Hildenbrand
0 siblings, 1 reply; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:48 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> The "memramp() shrinking" scenario no longer applies, so let's remove
> that now-unnecessary handling.
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
small comment, but this looks good.
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> arch/x86/mm/pat/memtype_interval.c | 44 ++++--------------------------
> 1 file changed, 6 insertions(+), 38 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
> index 645613d59942a..9d03f0dbc4715 100644
> --- a/arch/x86/mm/pat/memtype_interval.c
> +++ b/arch/x86/mm/pat/memtype_interval.c
> @@ -49,26 +49,15 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
>
> static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
>
> -enum {
> - MEMTYPE_EXACT_MATCH = 0,
> - MEMTYPE_END_MATCH = 1
> -};
> -
> -static struct memtype *memtype_match(u64 start, u64 end, int match_type)
> +static struct memtype *memtype_match(u64 start, u64 end)
> {
> struct memtype *entry_match;
>
> entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
>
> while (entry_match != NULL && entry_match->start < end) {
I think this could use interval_tree_for_each_span() instead.
> - if ((match_type == MEMTYPE_EXACT_MATCH) &&
> - (entry_match->start == start) && (entry_match->end == end))
> - return entry_match;
> -
> - if ((match_type == MEMTYPE_END_MATCH) &&
> - (entry_match->start < start) && (entry_match->end == end))
> + if (entry_match->start == start && entry_match->end == end)
> return entry_match;
> -
> entry_match = interval_iter_next(entry_match, start, end-1);
> }
>
> @@ -132,32 +121,11 @@ struct memtype *memtype_erase(u64 start, u64 end)
> {
> struct memtype *entry_old;
>
> - /*
> - * Since the memtype_rbroot tree allows overlapping ranges,
> - * memtype_erase() checks with EXACT_MATCH first, i.e. free
> - * a whole node for the munmap case. If no such entry is found,
> - * it then checks with END_MATCH, i.e. shrink the size of a node
> - * from the end for the mremap case.
> - */
> - entry_old = memtype_match(start, end, MEMTYPE_EXACT_MATCH);
> - if (!entry_old) {
> - entry_old = memtype_match(start, end, MEMTYPE_END_MATCH);
> - if (!entry_old)
> - return ERR_PTR(-EINVAL);
> - }
> -
> - if (entry_old->start == start) {
> - /* munmap: erase this node */
> - interval_remove(entry_old, &memtype_rbroot);
> - } else {
> - /* mremap: update the end value of this node */
> - interval_remove(entry_old, &memtype_rbroot);
> - entry_old->end = start;
> - interval_insert(entry_old, &memtype_rbroot);
> -
> - return NULL;
> - }
> + entry_old = memtype_match(start, end);
> + if (!entry_old)
> + return ERR_PTR(-EINVAL);
>
> + interval_remove(entry_old, &memtype_rbroot);
> return entry_old;
> }
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase()
2025-05-12 12:34 ` [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase() David Hildenbrand
2025-05-12 16:49 ` Lorenzo Stoakes
@ 2025-05-13 17:49 ` Liam R. Howlett
1 sibling, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu
* David Hildenbrand <david@redhat.com> [250512 08:34]:
> Let's just have it in a single function. The resulting function is
> certainly small enough and readable.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Same comment about interval_tree_for_each_span() here, I guess.
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> arch/x86/mm/pat/memtype_interval.c | 33 +++++++++---------------------
> 1 file changed, 10 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
> index 9d03f0dbc4715..e5844ed1311ed 100644
> --- a/arch/x86/mm/pat/memtype_interval.c
> +++ b/arch/x86/mm/pat/memtype_interval.c
> @@ -49,21 +49,6 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
>
> static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
>
> -static struct memtype *memtype_match(u64 start, u64 end)
> -{
> - struct memtype *entry_match;
> -
> - entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
> -
> - while (entry_match != NULL && entry_match->start < end) {
> - if (entry_match->start == start && entry_match->end == end)
> - return entry_match;
> - entry_match = interval_iter_next(entry_match, start, end-1);
> - }
> -
> - return NULL; /* Returns NULL if there is no match */
> -}
> -
> static int memtype_check_conflict(u64 start, u64 end,
> enum page_cache_mode reqtype,
> enum page_cache_mode *newtype)
> @@ -119,14 +104,16 @@ int memtype_check_insert(struct memtype *entry_new, enum page_cache_mode *ret_ty
>
> struct memtype *memtype_erase(u64 start, u64 end)
> {
> - struct memtype *entry_old;
> -
> - entry_old = memtype_match(start, end);
> - if (!entry_old)
> - return ERR_PTR(-EINVAL);
> -
> - interval_remove(entry_old, &memtype_rbroot);
> - return entry_old;
> + struct memtype *entry = interval_iter_first(&memtype_rbroot, start, end - 1);
> +
> + while (entry && entry->start < end) {
> + if (entry->start == start && entry->end == end) {
> + interval_remove(entry, &memtype_rbroot);
> + return entry;
> + }
> + entry = interval_iter_next(entry, start, end - 1);
> + }
> + return ERR_PTR(-EINVAL);
> }
>
> struct memtype *memtype_lookup(u64 addr)
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 10/11] drm/i915: track_pfn() -> "pfnmap tracking"
2025-05-12 12:34 ` [PATCH v2 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
@ 2025-05-13 17:50 ` Liam R. Howlett
0 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:50 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:35]:
> track_pfn() does not exist, let's simply refer to it as "pfnmap
> tracking".
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> drivers/gpu/drm/i915/i915_mm.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_mm.c b/drivers/gpu/drm/i915/i915_mm.c
> index 76e2801619f09..c33bd3d830699 100644
> --- a/drivers/gpu/drm/i915/i915_mm.c
> +++ b/drivers/gpu/drm/i915/i915_mm.c
> @@ -100,7 +100,7 @@ int remap_io_mapping(struct vm_area_struct *vma,
>
> GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
>
> - /* We rely on prevalidation of the io-mapping to skip track_pfn(). */
> + /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
> r.mm = vma->vm_mm;
> r.pfn = pfn;
> r.prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
> @@ -140,7 +140,7 @@ int remap_io_sg(struct vm_area_struct *vma,
> };
> int err;
>
> - /* We rely on prevalidation of the io-mapping to skip track_pfn(). */
> + /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
> GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS);
>
> while (offset >= r.sgt.max >> PAGE_SHIFT) {
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 11/11] mm/io-mapping: track_pfn() -> "pfnmap tracking"
2025-05-12 12:34 ` [PATCH v2 11/11] mm/io-mapping: " David Hildenbrand
@ 2025-05-13 17:50 ` Liam R. Howlett
0 siblings, 0 replies; 36+ messages in thread
From: Liam R. Howlett @ 2025-05-13 17:50 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, x86, intel-gfx, dri-devel,
linux-trace-kernel, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
David Airlie, Simona Vetter, Andrew Morton, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Pedro Falcato, Peter Xu, Ingo Molnar
* David Hildenbrand <david@redhat.com> [250512 08:35]:
> track_pfn() does not exist, let's simply refer to it as "pfnmap
> tracking".
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> mm/io-mapping.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/io-mapping.c b/mm/io-mapping.c
> index f44a6a1347123..d3586e95c12c5 100644
> --- a/mm/io-mapping.c
> +++ b/mm/io-mapping.c
> @@ -24,7 +24,7 @@ int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma,
> pgprot_t remap_prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) |
> (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK));
>
> - /* We rely on prevalidation of the io-mapping to skip track_pfn(). */
> + /* We rely on prevalidation of the io-mapping to skip pfnmap tracking. */
> return remap_pfn_range_notrack(vma, addr, pfn, size, remap_prot);
> }
> EXPORT_SYMBOL_GPL(io_mapping_map_user);
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 08/11] x86/mm/pat: remove MEMTYPE_*_MATCH
2025-05-13 17:48 ` Liam R. Howlett
@ 2025-05-14 17:53 ` David Hildenbrand
2025-05-15 14:10 ` David Hildenbrand
0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2025-05-14 17:53 UTC (permalink / raw)
To: Liam R. Howlett, linux-kernel, linux-mm, x86, intel-gfx,
dri-devel, linux-trace-kernel, Dave Hansen, Andy Lutomirski,
Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
H. Peter Anvin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, David Airlie, Simona Vetter, Andrew Morton,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
On 13.05.25 19:48, Liam R. Howlett wrote:
> * David Hildenbrand <david@redhat.com> [250512 08:34]:
>> The "memramp() shrinking" scenario no longer applies, so let's remove
>> that now-unnecessary handling.
>>
>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> small comment, but this looks good.
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Thanks!
>
>> ---
>> arch/x86/mm/pat/memtype_interval.c | 44 ++++--------------------------
>> 1 file changed, 6 insertions(+), 38 deletions(-)
>>
>> diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
>> index 645613d59942a..9d03f0dbc4715 100644
>> --- a/arch/x86/mm/pat/memtype_interval.c
>> +++ b/arch/x86/mm/pat/memtype_interval.c
>> @@ -49,26 +49,15 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
>>
>> static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
>>
>> -enum {
>> - MEMTYPE_EXACT_MATCH = 0,
>> - MEMTYPE_END_MATCH = 1
>> -};
>> -
>> -static struct memtype *memtype_match(u64 start, u64 end, int match_type)
>> +static struct memtype *memtype_match(u64 start, u64 end)
>> {
>> struct memtype *entry_match;
>>
>> entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
>>
>> while (entry_match != NULL && entry_match->start < end) {
>
> I think this could use interval_tree_for_each_span() instead.
Fancy, let me look at this. Probably I'll send another patch on top of
this series to do that conversion. (as you found, patch #9 moves that code)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 03/11] mm: introduce pfnmap_track() and pfnmap_untrack() and use them for memremap
2025-05-13 17:40 ` Liam R. Howlett
@ 2025-05-14 17:57 ` David Hildenbrand
0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-14 17:57 UTC (permalink / raw)
To: Liam R. Howlett, linux-kernel, linux-mm, x86, intel-gfx,
dri-devel, linux-trace-kernel, Dave Hansen, Andy Lutomirski,
Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
H. Peter Anvin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, David Airlie, Simona Vetter, Andrew Morton,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
On 13.05.25 19:40, Liam R. Howlett wrote:
> * David Hildenbrand <david@redhat.com> [250512 08:34]:
>> Let's provide variants of track_pfn_remap() and untrack_pfn() that won't
>> mess with VMAs, and replace the usage in mm/memremap.c.
>>
>> Add some documentation.
>>
>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Small nit with this one, but either way:
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Thanks!
[...]
>
> The other user is pfnmap_track_ctx_alloc() in mm/memory.c which is
> called from remap_pfn_range(), which also has addr.
>
> Couldn't we just use the address directly?
>
> I think the same holds for untrack as well.
Hm, conceptually, I want the "pfntrack" interface to consume ... PFNs :)
Actually, I was thinking about converting the "size" parameter to
nr_pages as well, but decided to leave that for another day.
... because I really should be working on (... checking todo list ...)
anything else but PAT at this point.
So unless there are strong feelings, I'll leave it that way (the way the
old interface also used it), and add it to my todo list (either make it
an address or make size -> nr_pages).
Thanks for all the review Liam!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v2 08/11] x86/mm/pat: remove MEMTYPE_*_MATCH
2025-05-14 17:53 ` David Hildenbrand
@ 2025-05-15 14:10 ` David Hildenbrand
0 siblings, 0 replies; 36+ messages in thread
From: David Hildenbrand @ 2025-05-15 14:10 UTC (permalink / raw)
To: Liam R. Howlett, linux-kernel, linux-mm, x86, intel-gfx,
dri-devel, linux-trace-kernel, Dave Hansen, Andy Lutomirski,
Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
H. Peter Anvin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
Tvrtko Ursulin, David Airlie, Simona Vetter, Andrew Morton,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
Peter Xu, Ingo Molnar
On 14.05.25 19:53, David Hildenbrand wrote:
> On 13.05.25 19:48, Liam R. Howlett wrote:
>> * David Hildenbrand <david@redhat.com> [250512 08:34]:
>>> The "memramp() shrinking" scenario no longer applies, so let's remove
>>> that now-unnecessary handling.
>>>
>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 bits
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> small comment, but this looks good.
>>
>> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>
> Thanks!
>
>>
>>> ---
>>> arch/x86/mm/pat/memtype_interval.c | 44 ++++--------------------------
>>> 1 file changed, 6 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/arch/x86/mm/pat/memtype_interval.c b/arch/x86/mm/pat/memtype_interval.c
>>> index 645613d59942a..9d03f0dbc4715 100644
>>> --- a/arch/x86/mm/pat/memtype_interval.c
>>> +++ b/arch/x86/mm/pat/memtype_interval.c
>>> @@ -49,26 +49,15 @@ INTERVAL_TREE_DEFINE(struct memtype, rb, u64, subtree_max_end,
>>>
>>> static struct rb_root_cached memtype_rbroot = RB_ROOT_CACHED;
>>>
>>> -enum {
>>> - MEMTYPE_EXACT_MATCH = 0,
>>> - MEMTYPE_END_MATCH = 1
>>> -};
>>> -
>>> -static struct memtype *memtype_match(u64 start, u64 end, int match_type)
>>> +static struct memtype *memtype_match(u64 start, u64 end)
>>> {
>>> struct memtype *entry_match;
>>>
>>> entry_match = interval_iter_first(&memtype_rbroot, start, end-1);
>>>
>>> while (entry_match != NULL && entry_match->start < end) {
>>
>> I think this could use interval_tree_for_each_span() instead.
>
> Fancy, let me look at this. Probably I'll send another patch on top of
> this series to do that conversion. (as you found, patch #9 moves that code)
Hmmm, I think interval_tree_for_each_span() does not apply here.
Unless I am missing something important, interval_tree_for_each_span()
does not work in combination with INTERVAL_TREE_DEFINE where we want to
use a custom type as tree nodes (-> struct memtype).
interval_tree_for_each_span() only works with the basic "struct
interval_tree_node" implementation ... which is probably also why there
are only a handful (3) of interval_tree_for_each_span() users, all in
iommufd context?
But staring at interval_tree.h vs. interval_tree_generic.h, I am a bit
confused ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2025-05-15 14:10 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-12 12:34 [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT David Hildenbrand
2025-05-12 12:34 ` [PATCH v2 01/11] x86/mm/pat: factor out setting cachemode into pgprot_set_cachemode() David Hildenbrand
2025-05-13 17:29 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 02/11] mm: convert track_pfn_insert() to pfnmap_setup_cachemode*() David Hildenbrand
2025-05-12 15:43 ` Lorenzo Stoakes
2025-05-13 9:06 ` David Hildenbrand
2025-05-13 17:29 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 03/11] mm: introduce pfnmap_track() and pfnmap_untrack() and use them for memremap David Hildenbrand
2025-05-13 17:40 ` Liam R. Howlett
2025-05-14 17:57 ` David Hildenbrand
2025-05-12 12:34 ` [PATCH v2 04/11] mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack() David Hildenbrand
2025-05-12 16:42 ` Lorenzo Stoakes
2025-05-13 9:10 ` David Hildenbrand
2025-05-13 10:16 ` Lorenzo Stoakes
2025-05-13 10:22 ` David Hildenbrand
2025-05-13 17:42 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 05/11] x86/mm/pat: remove old pfnmap tracking interface David Hildenbrand
2025-05-13 17:42 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 06/11] mm: remove VM_PAT David Hildenbrand
2025-05-13 17:42 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 07/11] x86/mm/pat: remove strict_prot parameter from reserve_pfn_range() David Hildenbrand
2025-05-13 17:43 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 08/11] x86/mm/pat: remove MEMTYPE_*_MATCH David Hildenbrand
2025-05-13 17:48 ` Liam R. Howlett
2025-05-14 17:53 ` David Hildenbrand
2025-05-15 14:10 ` David Hildenbrand
2025-05-12 12:34 ` [PATCH v2 09/11] x86/mm/pat: inline memtype_match() into memtype_erase() David Hildenbrand
2025-05-12 16:49 ` Lorenzo Stoakes
2025-05-13 9:11 ` David Hildenbrand
2025-05-13 17:49 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 10/11] drm/i915: track_pfn() -> "pfnmap tracking" David Hildenbrand
2025-05-13 17:50 ` Liam R. Howlett
2025-05-12 12:34 ` [PATCH v2 11/11] mm/io-mapping: " David Hildenbrand
2025-05-13 17:50 ` Liam R. Howlett
2025-05-13 15:53 ` [PATCH v2 00/11] mm: rewrite pfnmap tracking and remove VM_PAT Liam R. Howlett
2025-05-13 17:17 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).