[PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn

nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
@ 2025-06-11 12:06 David Hildenbrand
  2025-06-11 12:06 ` [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() David Hildenbrand
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-11 12:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, nvdimm, linux-cxl, David Hildenbrand, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador

This is v2 of
	"[PATCH v1 0/2] mm/huge_memory: don't mark refcounted pages special
	 in vmf_insert_folio_*()"
Now with one additional fix, based on mm/mm-unstable.

While working on improving vm_normal_page() and friends, I stumbled
over this issues: refcounted "normal" pages must not be marked
using pmd_special() / pud_special().

Fortunately, so far there doesn't seem to be serious damage.

I spent too much time trying to get the ndctl tests mentioned by Dan
running (.config tweaks, memmap= setup, ... ), without getting them to
pass even without these patches. Some SKIP, some FAIL, some sometimes
suddenly SKIP on first invocation, ... instructions unclear or the tests
are shaky. This is how far I got:

# meson test -C build --suite ndctl:dax
ninja: Entering directory `/root/ndctl/build'
[1/70] Generating version.h with a custom command
 1/13 ndctl:dax / daxdev-errors.sh          OK              15.08s
 2/13 ndctl:dax / multi-dax.sh              OK               5.80s
 3/13 ndctl:dax / sub-section.sh            SKIP             0.39s   exit status 77
 4/13 ndctl:dax / dax-dev                   OK               1.37s
 5/13 ndctl:dax / dax-ext4.sh               OK              32.70s
 6/13 ndctl:dax / dax-xfs.sh                OK              29.43s
 7/13 ndctl:dax / device-dax                OK              44.50s
 8/13 ndctl:dax / revoke-devmem             OK               0.98s
 9/13 ndctl:dax / device-dax-fio.sh         SKIP             0.10s   exit status 77
10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.16s   exit status 77
11/13 ndctl:dax / daxctl-create.sh          FAIL             2.61s   exit status 1
12/13 ndctl:dax / dm.sh                     FAIL             0.23s   exit status 1
13/13 ndctl:dax / mmap.sh                   OK             437.86s

So, no idea if this series breaks something, because the tests are rather
unreliable. I have plenty of other debug settings on, maybe that's a
problem? I guess if the FS tests and mmap test pass, we're mostly good.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>

v1 -> v2:
* "mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()"
 -> Added after stumbling over that
* Modified the other tests to reuse the existing function by passing a
  new struct
* Renamed the patches to talk about "folios" instead of pages and adjusted
  the patch descriptions
* Dropped RB/TB from Dan and Oscar due to the changes

David Hildenbrand (3):
  mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  mm/huge_memory: don't mark refcounted folios special in
    vmf_insert_folio_pmd()
  mm/huge_memory: don't mark refcounted folios special in
    vmf_insert_folio_pud()

 include/linux/mm.h |  19 +++++++-
 mm/huge_memory.c   | 110 +++++++++++++++++++++++++++------------------
 2 files changed, 85 insertions(+), 44 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-11 12:06 [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes David Hildenbrand
@ 2025-06-11 12:06 ` David Hildenbrand
  2025-06-12  1:56   ` Alistair Popple
                     ` (3 more replies)
  2025-06-11 12:06 ` [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd() David Hildenbrand
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-11 12:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, nvdimm, linux-cxl, David Hildenbrand, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador, stable

We setup the cache mode but ... don't forward the updated pgprot to
insert_pfn_pud().

Only a problem on x86-64 PAT when mapping PFNs using PUDs that
require a special cachemode.

Fix it by using the proper pgprot where the cachemode was setup.

Identified by code inspection.

Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")
Cc: <stable@vger.kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a3..49b98082c5401 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1516,10 +1516,9 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 }
 
 static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, pfn_t pfn, bool write)
+		pud_t *pud, pfn_t pfn, pgprot_t prot, bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pgprot_t prot = vma->vm_page_prot;
 	pud_t entry;
 
 	if (!pud_none(*pud)) {
@@ -1581,7 +1580,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
 
 	ptl = pud_lock(vma->vm_mm, vmf->pud);
-	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
+	insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write);
 	spin_unlock(ptl);
 
 	return VM_FAULT_NOPAGE;
@@ -1625,7 +1624,7 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
 		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
 	}
 	insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)),
-		write);
+		       vma->vm_page_prot, write);
 	spin_unlock(ptl);
 
 	return VM_FAULT_NOPAGE;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()
  2025-06-11 12:06 [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes David Hildenbrand
  2025-06-11 12:06 ` [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() David Hildenbrand
@ 2025-06-11 12:06 ` David Hildenbrand
  2025-06-12  2:17   ` Alistair Popple
                     ` (3 more replies)
  2025-06-11 12:06 ` [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud() David Hildenbrand
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-11 12:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, nvdimm, linux-cxl, David Hildenbrand, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador

Marking PMDs that map a "normal" refcounted folios as special is
against our rules documented for vm_normal_page().

Fortunately, there are not that many pmd_special() check that can be
mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that
would get this wrong right now are rather harmless: e.g., none so far
bases decisions whether to grab a folio reference on that decision.

Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
implications as it seems.

Getting this right will get more important as we use
folio_normal_page_pmd() in more places.

Fix it by teaching insert_pfn_pmd() to properly handle folios and
pfns -- moving refcount/mapcount/etc handling in there, renaming it to
insert_pmd(), and distinguishing between both cases using a new simple
"struct folio_or_pfn" structure.

Use folio_mk_pmd() to create a pmd for a folio cleanly.

Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 58 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 49b98082c5401..7e3e9028873e5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1372,9 +1372,17 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	return __do_huge_pmd_anonymous_page(vmf);
 }
 
-static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
-		pgtable_t pgtable)
+struct folio_or_pfn {
+	union {
+		struct folio *folio;
+		pfn_t pfn;
+	};
+	bool is_folio;
+};
+
+static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
+		pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
+		bool write, pgtable_t pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t entry;
@@ -1382,8 +1390,11 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	lockdep_assert_held(pmd_lockptr(mm, pmd));
 
 	if (!pmd_none(*pmd)) {
+		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
+					  pfn_t_to_pfn(fop.pfn);
+
 		if (write) {
-			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
+			if (pmd_pfn(*pmd) != pfn) {
 				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
 				return -EEXIST;
 			}
@@ -1396,11 +1407,19 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		return -EEXIST;
 	}
 
-	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
-	if (pfn_t_devmap(pfn))
-		entry = pmd_mkdevmap(entry);
-	else
-		entry = pmd_mkspecial(entry);
+	if (fop.is_folio) {
+		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
+
+		folio_get(fop.folio);
+		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
+		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+	} else {
+		entry = pmd_mkhuge(pfn_t_pmd(fop.pfn, prot));
+		if (pfn_t_devmap(fop.pfn))
+			entry = pmd_mkdevmap(entry);
+		else
+			entry = pmd_mkspecial(entry);
+	}
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -1431,6 +1450,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	unsigned long addr = vmf->address & PMD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
 	pgprot_t pgprot = vma->vm_page_prot;
+	struct folio_or_pfn fop = {
+		.pfn = pfn,
+	};
 	pgtable_t pgtable = NULL;
 	spinlock_t *ptl;
 	int error;
@@ -1458,8 +1480,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
 
 	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-	error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
-			pgtable);
+	error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
+			   pgtable);
 	spin_unlock(ptl);
 	if (error && pgtable)
 		pte_free(vma->vm_mm, pgtable);
@@ -1474,6 +1496,10 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address & PMD_MASK;
 	struct mm_struct *mm = vma->vm_mm;
+	struct folio_or_pfn fop = {
+		.folio = folio,
+		.is_folio = true,
+	};
 	spinlock_t *ptl;
 	pgtable_t pgtable = NULL;
 	int error;
@@ -1491,14 +1517,8 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
 	}
 
 	ptl = pmd_lock(mm, vmf->pmd);
-	if (pmd_none(*vmf->pmd)) {
-		folio_get(folio);
-		folio_add_file_rmap_pmd(folio, &folio->page, vma);
-		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
-	}
-	error = insert_pfn_pmd(vma, addr, vmf->pmd,
-			pfn_to_pfn_t(folio_pfn(folio)), vma->vm_page_prot,
-			write, pgtable);
+	error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
+			   write, pgtable);
 	spin_unlock(ptl);
 	if (error && pgtable)
 		pte_free(mm, pgtable);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()
  2025-06-11 12:06 [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes David Hildenbrand
  2025-06-11 12:06 ` [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() David Hildenbrand
  2025-06-11 12:06 ` [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd() David Hildenbrand
@ 2025-06-11 12:06 ` David Hildenbrand
  2025-06-12  4:40   ` Dan Williams
                     ` (2 more replies)
  2025-06-11 23:08 ` [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes Andrew Morton
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-11 12:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, nvdimm, linux-cxl, David Hildenbrand, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador

Marking PUDs that map a "normal" refcounted folios as special is
against our rules documented for vm_normal_page().

Fortunately, there are not that many pud_special() check that can be
mislead and are right now rather harmless: e.g., none so far
bases decisions whether to grab a folio reference on that decision.

Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
implications as it seems.

Getting this right will get more important as we introduce
folio_normal_page_pud() and start using it in more place where we
currently special-case based on other VMA flags.

Fix it just like we fixed vmf_insert_folio_pmd().

Add folio_mk_pud() to mimic what we do with folio_mk_pmd().

Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 19 ++++++++++++++++-
 mm/huge_memory.c   | 51 +++++++++++++++++++++++++---------------------
 2 files changed, 46 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa538feaa8d95..912b6d40a12d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1816,7 +1816,24 @@ static inline pmd_t folio_mk_pmd(struct folio *folio, pgprot_t pgprot)
 {
 	return pmd_mkhuge(pfn_pmd(folio_pfn(folio), pgprot));
 }
-#endif
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+/**
+ * folio_mk_pud - Create a PUD for this folio
+ * @folio: The folio to create a PUD for
+ * @pgprot: The page protection bits to use
+ *
+ * Create a page table entry for the first page of this folio.
+ * This is suitable for passing to set_pud_at().
+ *
+ * Return: A page table entry suitable for mapping this folio.
+ */
+static inline pud_t folio_mk_pud(struct folio *folio, pgprot_t pgprot)
+{
+	return pud_mkhuge(pfn_pud(folio_pfn(folio), pgprot));
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_MMU */
 
 static inline bool folio_has_pincount(const struct folio *folio)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7e3e9028873e5..4734de1dc0ae4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1535,15 +1535,18 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 	return pud;
 }
 
-static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, pfn_t pfn, pgprot_t prot, bool write)
+static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
+		pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pud_t entry;
 
 	if (!pud_none(*pud)) {
+		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
+					  pfn_t_to_pfn(fop.pfn);
+
 		if (write) {
-			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn)))
+			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
 				return;
 			entry = pud_mkyoung(*pud);
 			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
@@ -1553,11 +1556,19 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 		return;
 	}
 
-	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
-	if (pfn_t_devmap(pfn))
-		entry = pud_mkdevmap(entry);
-	else
-		entry = pud_mkspecial(entry);
+	if (fop.is_folio) {
+		entry = folio_mk_pud(fop.folio, vma->vm_page_prot);
+
+		folio_get(fop.folio);
+		folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma);
+		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR);
+	} else {
+		entry = pud_mkhuge(pfn_t_pud(fop.pfn, prot));
+		if (pfn_t_devmap(fop.pfn))
+			entry = pud_mkdevmap(entry);
+		else
+			entry = pud_mkspecial(entry);
+	}
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
@@ -1581,6 +1592,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	unsigned long addr = vmf->address & PUD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
 	pgprot_t pgprot = vma->vm_page_prot;
+	struct folio_or_pfn fop = {
+		.pfn = pfn,
+	};
 	spinlock_t *ptl;
 
 	/*
@@ -1600,7 +1614,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
 
 	ptl = pud_lock(vma->vm_mm, vmf->pud);
-	insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write);
+	insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
 	spin_unlock(ptl);
 
 	return VM_FAULT_NOPAGE;
@@ -1622,6 +1636,10 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
 	unsigned long addr = vmf->address & PUD_MASK;
 	pud_t *pud = vmf->pud;
 	struct mm_struct *mm = vma->vm_mm;
+	struct folio_or_pfn fop = {
+		.folio = folio,
+		.is_folio = true,
+	};
 	spinlock_t *ptl;
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
@@ -1631,20 +1649,7 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
 		return VM_FAULT_SIGBUS;
 
 	ptl = pud_lock(mm, pud);
-
-	/*
-	 * If there is already an entry present we assume the folio is
-	 * already mapped, hence no need to take another reference. We
-	 * still call insert_pfn_pud() though in case the mapping needs
-	 * upgrading to writeable.
-	 */
-	if (pud_none(*vmf->pud)) {
-		folio_get(folio);
-		folio_add_file_rmap_pud(folio, &folio->page, vma);
-		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
-	}
-	insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)),
-		       vma->vm_page_prot, write);
+	insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
 	spin_unlock(ptl);
 
 	return VM_FAULT_NOPAGE;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-11 12:06 [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-06-11 12:06 ` [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud() David Hildenbrand
@ 2025-06-11 23:08 ` Andrew Morton
  2025-06-12  7:34   ` David Hildenbrand
  2025-06-12  2:26 ` Alistair Popple
  2025-06-12 16:19 ` Lorenzo Stoakes
  5 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2025-06-11 23:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Alistair Popple,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On Wed, 11 Jun 2025 14:06:51 +0200 David Hildenbrand <david@redhat.com> wrote:

> While working on improving vm_normal_page() and friends, I stumbled
> over this issues: refcounted "normal" pages must not be marked
> using pmd_special() / pud_special().

Why is this?

>
> ...
>
> I spent too much time trying to get the ndctl tests mentioned by Dan
> running (.config tweaks, memmap= setup, ... ), without getting them to
> pass even without these patches. Some SKIP, some FAIL, some sometimes
> suddenly SKIP on first invocation, ... instructions unclear or the tests
> are shaky. This is how far I got:

I won't include this in the [0/N] - it doesn't seem helpful for future
readers of the patchset.

I'll give the patchset a run in mm-new, but it feels like some more
baking is needed?

The [1/N] has cc:stable but there's nothing in there to explain this
decision.  How does the issues affect userspace?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-11 12:06 ` [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() David Hildenbrand
@ 2025-06-12  1:56   ` Alistair Popple
  2025-06-12  6:55     ` David Hildenbrand
  2025-06-12  4:34   ` Dan Williams
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Alistair Popple @ 2025-06-12  1:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador, stable

On Wed, Jun 11, 2025 at 02:06:52PM +0200, David Hildenbrand wrote:
> We setup the cache mode but ... don't forward the updated pgprot to
> insert_pfn_pud().
> 
> Only a problem on x86-64 PAT when mapping PFNs using PUDs that
> require a special cachemode.
> 
> Fix it by using the proper pgprot where the cachemode was setup.
> 
> Identified by code inspection.
> 
> Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d3e66136e41a3..49b98082c5401 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1516,10 +1516,9 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
>  }
>  
>  static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
> -		pud_t *pud, pfn_t pfn, bool write)
> +		pud_t *pud, pfn_t pfn, pgprot_t prot, bool write)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pgprot_t prot = vma->vm_page_prot;
>  	pud_t entry;
>  
>  	if (!pud_none(*pud)) {
> @@ -1581,7 +1580,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>  
>  	ptl = pud_lock(vma->vm_mm, vmf->pud);
> -	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> +	insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write);
>  	spin_unlock(ptl);
>  
>  	return VM_FAULT_NOPAGE;
> @@ -1625,7 +1624,7 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>  		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
>  	}
>  	insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)),
> -		write);
> +		       vma->vm_page_prot, write);

Actually It's not immediately obvious to me why we don't call track_pfn_insert()
and forward the pgprot here as well. Prior to me adding vmf_insert_folio_pud()
device DAX would call vmf_insert_pfn_pud(), and the intent at least seems to
have been to change pgprot for that (and we did for the PTE/PMD versions).

However now that the ZONE_DEVICE folios are refcounted normally I switched
device dax to using vmf_insert_folio_*() which never changes pgprot based on x86
PAT. So I think we probably need to either add that to vmf_insert_folio_*() or
a new variant or make it the responsibility of callers to figure out the correct
pgprot.

>  	spin_unlock(ptl);
>  
>  	return VM_FAULT_NOPAGE;
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()
  2025-06-11 12:06 ` [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd() David Hildenbrand
@ 2025-06-12  2:17   ` Alistair Popple
  2025-06-12  7:06     ` David Hildenbrand
  2025-06-12  4:36   ` Dan Williams
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Alistair Popple @ 2025-06-12  2:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On Wed, Jun 11, 2025 at 02:06:53PM +0200, David Hildenbrand wrote:
> Marking PMDs that map a "normal" refcounted folios as special is
> against our rules documented for vm_normal_page().
> 
> Fortunately, there are not that many pmd_special() check that can be
> mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that
> would get this wrong right now are rather harmless: e.g., none so far
> bases decisions whether to grab a folio reference on that decision.
> 
> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
> implications as it seems.
> 
> Getting this right will get more important as we use
> folio_normal_page_pmd() in more places.
> 
> Fix it by teaching insert_pfn_pmd() to properly handle folios and
> pfns -- moving refcount/mapcount/etc handling in there, renaming it to
> insert_pmd(), and distinguishing between both cases using a new simple
> "struct folio_or_pfn" structure.
> 
> Use folio_mk_pmd() to create a pmd for a folio cleanly.
> 
> Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()")
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 58 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 39 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 49b98082c5401..7e3e9028873e5 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1372,9 +1372,17 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  	return __do_huge_pmd_anonymous_page(vmf);
>  }
>  
> -static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> -		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
> -		pgtable_t pgtable)
> +struct folio_or_pfn {
> +	union {
> +		struct folio *folio;
> +		pfn_t pfn;
> +	};
> +	bool is_folio;
> +};

I know it's simple, but I'm still not a fan particularly as these types of
patterns tend to proliferate once introduced. See below for a suggestion.

> +static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> +		pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
> +		bool write, pgtable_t pgtable)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pmd_t entry;
> @@ -1382,8 +1390,11 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	lockdep_assert_held(pmd_lockptr(mm, pmd));
>  
>  	if (!pmd_none(*pmd)) {
> +		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
> +					  pfn_t_to_pfn(fop.pfn);
> +
>  		if (write) {
> -			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
> +			if (pmd_pfn(*pmd) != pfn) {
>  				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
>  				return -EEXIST;
>  			}
> @@ -1396,11 +1407,19 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		return -EEXIST;
>  	}
>  
> -	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> -	if (pfn_t_devmap(pfn))
> -		entry = pmd_mkdevmap(entry);
> -	else
> -		entry = pmd_mkspecial(entry);
> +	if (fop.is_folio) {
> +		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
> +
> +		folio_get(fop.folio);
> +		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> +		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> +	} else {
> +		entry = pmd_mkhuge(pfn_t_pmd(fop.pfn, prot));
> +		if (pfn_t_devmap(fop.pfn))
> +			entry = pmd_mkdevmap(entry);
> +		else
> +			entry = pmd_mkspecial(entry);
> +	}

Could we change insert_pfn_pmd() to insert_pmd_entry() and have callers call
something like pfn_to_pmd_entry() or folio_to_pmd_entry() to create the pmd_t
entry as appropriate, which is then passed to insert_pmd_entry() to do the bits
common to both?

>  	if (write) {
>  		entry = pmd_mkyoung(pmd_mkdirty(entry));
>  		entry = maybe_pmd_mkwrite(entry, vma);
> @@ -1431,6 +1450,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	unsigned long addr = vmf->address & PMD_MASK;
>  	struct vm_area_struct *vma = vmf->vma;
>  	pgprot_t pgprot = vma->vm_page_prot;
> +	struct folio_or_pfn fop = {
> +		.pfn = pfn,
> +	};
>  	pgtable_t pgtable = NULL;
>  	spinlock_t *ptl;
>  	int error;
> @@ -1458,8 +1480,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>  
>  	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> -	error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
> -			pgtable);
> +	error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
> +			   pgtable);
>  	spin_unlock(ptl);
>  	if (error && pgtable)
>  		pte_free(vma->vm_mm, pgtable);
> @@ -1474,6 +1496,10 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
>  	struct vm_area_struct *vma = vmf->vma;
>  	unsigned long addr = vmf->address & PMD_MASK;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct folio_or_pfn fop = {
> +		.folio = folio,
> +		.is_folio = true,
> +	};
>  	spinlock_t *ptl;
>  	pgtable_t pgtable = NULL;
>  	int error;
> @@ -1491,14 +1517,8 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
>  	}
>  
>  	ptl = pmd_lock(mm, vmf->pmd);
> -	if (pmd_none(*vmf->pmd)) {
> -		folio_get(folio);
> -		folio_add_file_rmap_pmd(folio, &folio->page, vma);
> -		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> -	}
> -	error = insert_pfn_pmd(vma, addr, vmf->pmd,
> -			pfn_to_pfn_t(folio_pfn(folio)), vma->vm_page_prot,
> -			write, pgtable);
> +	error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
> +			   write, pgtable);
>  	spin_unlock(ptl);
>  	if (error && pgtable)
>  		pte_free(mm, pgtable);
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-11 12:06 [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-06-11 23:08 ` [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes Andrew Morton
@ 2025-06-12  2:26 ` Alistair Popple
  2025-06-12  4:20   ` Dan Williams
  2025-06-12 16:19 ` Lorenzo Stoakes
  5 siblings, 1 reply; 35+ messages in thread
From: Alistair Popple @ 2025-06-12  2:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On Wed, Jun 11, 2025 at 02:06:51PM +0200, David Hildenbrand wrote:
> This is v2 of
> 	"[PATCH v1 0/2] mm/huge_memory: don't mark refcounted pages special
> 	 in vmf_insert_folio_*()"
> Now with one additional fix, based on mm/mm-unstable.
> 
> While working on improving vm_normal_page() and friends, I stumbled
> over this issues: refcounted "normal" pages must not be marked
> using pmd_special() / pud_special().
> 
> Fortunately, so far there doesn't seem to be serious damage.
> 
> I spent too much time trying to get the ndctl tests mentioned by Dan
> running (.config tweaks, memmap= setup, ... ), without getting them to
> pass even without these patches. Some SKIP, some FAIL, some sometimes
> suddenly SKIP on first invocation, ... instructions unclear or the tests
> are shaky. This is how far I got:

FWIW I had a similar experience, although I eventually got the FAIL cases below
to pass. I forget exactly what I needed to tweak for that though :-/

> # meson test -C build --suite ndctl:dax
> ninja: Entering directory `/root/ndctl/build'
> [1/70] Generating version.h with a custom command
>  1/13 ndctl:dax / daxdev-errors.sh          OK              15.08s
>  2/13 ndctl:dax / multi-dax.sh              OK               5.80s
>  3/13 ndctl:dax / sub-section.sh            SKIP             0.39s   exit status 77
>  4/13 ndctl:dax / dax-dev                   OK               1.37s
>  5/13 ndctl:dax / dax-ext4.sh               OK              32.70s
>  6/13 ndctl:dax / dax-xfs.sh                OK              29.43s
>  7/13 ndctl:dax / device-dax                OK              44.50s
>  8/13 ndctl:dax / revoke-devmem             OK               0.98s
>  9/13 ndctl:dax / device-dax-fio.sh         SKIP             0.10s   exit status 77
> 10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.16s   exit status 77
> 11/13 ndctl:dax / daxctl-create.sh          FAIL             2.61s   exit status 1
> 12/13 ndctl:dax / dm.sh                     FAIL             0.23s   exit status 1
> 13/13 ndctl:dax / mmap.sh                   OK             437.86s
> 
> So, no idea if this series breaks something, because the tests are rather
> unreliable. I have plenty of other debug settings on, maybe that's a
> problem? I guess if the FS tests and mmap test pass, we're mostly good.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> 
> 
> v1 -> v2:
> * "mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()"
>  -> Added after stumbling over that
> * Modified the other tests to reuse the existing function by passing a
>   new struct
> * Renamed the patches to talk about "folios" instead of pages and adjusted
>   the patch descriptions
> * Dropped RB/TB from Dan and Oscar due to the changes
> 
> David Hildenbrand (3):
>   mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
>   mm/huge_memory: don't mark refcounted folios special in
>     vmf_insert_folio_pmd()
>   mm/huge_memory: don't mark refcounted folios special in
>     vmf_insert_folio_pud()
> 
>  include/linux/mm.h |  19 +++++++-
>  mm/huge_memory.c   | 110 +++++++++++++++++++++++++++------------------
>  2 files changed, 85 insertions(+), 44 deletions(-)
> 
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-12  2:26 ` Alistair Popple
@ 2025-06-12  4:20   ` Dan Williams
  2025-06-12  7:18     ` David Hildenbrand
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2025-06-12  4:20 UTC (permalink / raw)
  To: Alistair Popple, David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador,
	marc.herbert

Alistair Popple wrote:
> On Wed, Jun 11, 2025 at 02:06:51PM +0200, David Hildenbrand wrote:
> > This is v2 of
> > 	"[PATCH v1 0/2] mm/huge_memory: don't mark refcounted pages special
> > 	 in vmf_insert_folio_*()"
> > Now with one additional fix, based on mm/mm-unstable.
> > 
> > While working on improving vm_normal_page() and friends, I stumbled
> > over this issues: refcounted "normal" pages must not be marked
> > using pmd_special() / pud_special().
> > 
> > Fortunately, so far there doesn't seem to be serious damage.
> > 
> > I spent too much time trying to get the ndctl tests mentioned by Dan
> > running (.config tweaks, memmap= setup, ... ), without getting them to
> > pass even without these patches. Some SKIP, some FAIL, some sometimes
> > suddenly SKIP on first invocation, ... instructions unclear or the tests
> > are shaky. This is how far I got:
> 
> FWIW I had a similar experience, although I eventually got the FAIL cases below
> to pass. I forget exactly what I needed to tweak for that though :-/

Add Marc who has been working to clean the documentation up to solve the
reproducibility problem with standing up new environments to run these
tests.

http://lore.kernel.org/20250521002640.1700283-1-marc.herbert@linux.intel.com

There is also the run_qemu project that automates build an environment for this.

https://github.com/pmem/run_qemu

...but comes with its own set of quirks.

I have the following fixups applied to my environment to get his going on
Fedora 42 with v6.16-rc1:

diff --git a/README.md b/README.md
index 37314db7a155..8e06908d5921 100644
--- a/README.md
+++ b/README.md
@@ -84,6 +84,11 @@ loaded.  To build and install nfit_test.ko:
    CONFIG_TRANSPARENT_HUGEPAGE=y
    ```
 
+1. Install the following packages, (Fedora instructions):
+   ```
+   dnf install e2fsprogs xfsprogs parted jq trace-cmd hostname fio fio-engine-dev-dax
+   ```
+
 1. Build and install the unit test enabled libnvdimm modules in the
    following order.  The unit test modules need to be in place prior to
    the `depmod` that runs during the final `modules_install`  
diff --git a/test/dax.sh b/test/dax.sh
index 3ffbc8079eba..98faaf0eb9b2 100755
--- a/test/dax.sh
+++ b/test/dax.sh
@@ -37,13 +37,14 @@ run_test() {
 	rc=1
 	while read -r p; do
 		[[ $p ]] || continue
+		[[ $p == cpus=* ]] && continue
 		if [ "$count" -lt 10 ]; then
 			if [ "$p" != "0x100" ] && [ "$p" != "NOPAGE" ]; then
 				cleanup "$1"
 			fi
 		fi
 		count=$((count + 1))
-	done < <(trace-cmd report | awk '{ print $21 }')
+	done < <(trace-cmd report | awk '{ print $NF }')
 
 	if [ $count -lt 10 ]; then
 		cleanup "$1"

In the meantime, do not hesitate to ask me to run these tests.

FWIW with these patches on top of -rc1 I get:

---

[root@host ndctl]# meson test -C build --suite ndctl:dax
ninja: Entering directory `/root/git/ndctl/build'
[168/168] Linking target ndctl/ndctl
 1/13 ndctl:dax / daxdev-errors.sh          OK              12.60s
 2/13 ndctl:dax / multi-dax.sh              OK               2.47s
 3/13 ndctl:dax / sub-section.sh            OK               6.30s
 4/13 ndctl:dax / dax-dev                   OK               0.04s
 5/13 ndctl:dax / dax-ext4.sh               OK               3.04s
 6/13 ndctl:dax / dax-xfs.sh                OK               3.10s
 7/13 ndctl:dax / device-dax                OK               9.66s
 8/13 ndctl:dax / revoke-devmem             OK               0.22s
 9/13 ndctl:dax / device-dax-fio.sh         OK              32.32s
10/13 ndctl:dax / daxctl-devices.sh         OK               2.31s
11/13 ndctl:dax / daxctl-create.sh          SKIP             0.25s   exit status 77
12/13 ndctl:dax / dm.sh                     OK               1.00s
13/13 ndctl:dax / mmap.sh                   OK              62.27s

Ok:                12  
Fail:              0   
Skipped:           1   

Full log written to /root/git/ndctl/build/meson-logs/testlog.txt

---

Note that the daxctl-create.sh skip is a known unrelated v6.16-rc1 regression
fixed with this set:

http://lore.kernel.org/20250607033228.1475625-1-dan.j.williams@intel.com

You can add:

Tested-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-11 12:06 ` [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() David Hildenbrand
  2025-06-12  1:56   ` Alistair Popple
@ 2025-06-12  4:34   ` Dan Williams
  2025-06-12  6:46     ` David Hildenbrand
  2025-06-12 15:28   ` Lorenzo Stoakes
  2025-06-12 17:59   ` Jason Gunthorpe
  3 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2025-06-12  4:34 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, nvdimm, linux-cxl, David Hildenbrand, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador, stable

David Hildenbrand wrote:
> We setup the cache mode but ... don't forward the updated pgprot to
> insert_pfn_pud().
> 
> Only a problem on x86-64 PAT when mapping PFNs using PUDs that
> require a special cachemode.

This is only a problem if the kernel mapped the pud in advance of userspace
mapping it, right?

The change looks good.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...but I am struggling with the scenario where this causes problems in
practice, where vm_page_prot is the wrong cachemode.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()
  2025-06-11 12:06 ` [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd() David Hildenbrand
  2025-06-12  2:17   ` Alistair Popple
@ 2025-06-12  4:36   ` Dan Williams
  2025-06-12 16:10   ` Lorenzo Stoakes
  2025-06-12 18:02   ` Jason Gunthorpe
  3 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2025-06-12  4:36 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, nvdimm, linux-cxl, David Hildenbrand, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador

David Hildenbrand wrote:
> Marking PMDs that map a "normal" refcounted folios as special is
> against our rules documented for vm_normal_page().
> 
> Fortunately, there are not that many pmd_special() check that can be
> mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that
> would get this wrong right now are rather harmless: e.g., none so far
> bases decisions whether to grab a folio reference on that decision.
> 
> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
> implications as it seems.
> 
> Getting this right will get more important as we use
> folio_normal_page_pmd() in more places.
> 
> Fix it by teaching insert_pfn_pmd() to properly handle folios and
> pfns -- moving refcount/mapcount/etc handling in there, renaming it to
> insert_pmd(), and distinguishing between both cases using a new simple
> "struct folio_or_pfn" structure.
> 
> Use folio_mk_pmd() to create a pmd for a folio cleanly.

Looks good, I like copying the sockptr_t approach for this, and agree that this
seems to not cause any problems in practice today, but definitely will be a
trip hazard going forward.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()
  2025-06-11 12:06 ` [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud() David Hildenbrand
@ 2025-06-12  4:40   ` Dan Williams
  2025-06-12 16:49   ` Lorenzo Stoakes
  2025-06-12 18:02   ` Jason Gunthorpe
  2 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2025-06-12  4:40 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, nvdimm, linux-cxl, David Hildenbrand, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador

David Hildenbrand wrote:
> Marking PUDs that map a "normal" refcounted folios as special is
> against our rules documented for vm_normal_page().
> 
> Fortunately, there are not that many pud_special() check that can be
> mislead and are right now rather harmless: e.g., none so far
> bases decisions whether to grab a folio reference on that decision.
> 
> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
> implications as it seems.
> 
> Getting this right will get more important as we introduce
> folio_normal_page_pud() and start using it in more place where we
> currently special-case based on other VMA flags.
> 
> Fix it just like we fixed vmf_insert_folio_pmd().
> 
> Add folio_mk_pud() to mimic what we do with folio_mk_pmd().
> 
> Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()")
> Signed-off-by: David Hildenbrand <david@redhat.com>

Looks good to me.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-12  4:34   ` Dan Williams
@ 2025-06-12  6:46     ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12  6:46 UTC (permalink / raw)
  To: Dan Williams, linux-kernel
  Cc: linux-mm, nvdimm, linux-cxl, Andrew Morton, Alistair Popple,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Oscar Salvador, stable

On 12.06.25 06:34, Dan Williams wrote:
> David Hildenbrand wrote:
>> We setup the cache mode but ... don't forward the updated pgprot to
>> insert_pfn_pud().
>>
>> Only a problem on x86-64 PAT when mapping PFNs using PUDs that
>> require a special cachemode.
> 
> This is only a problem if the kernel mapped the pud in advance of userspace
> mapping it, right?

Good question, PAT code is confusing.

What I understood is that drivers like vfio will register the range with 
the expected cachemode, and then rely on vm_insert_* to fill out the 
cachemode for them.

Peter explained it in the dicussion here [1] how e.g., vfio triggers 
that early registration.

Regarding vfio, I can see that we do in vfio_pci_core_mmap() 
unconditionally:

vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);

and probably rely on us querying the actual cachemode to be used later.

vfio can map all kinds of different memory types ...

[1] https://lkml.kernel.org/r/aBDXr-Qp4z0tS50P@x1.local

> 
> The change looks good.
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
> ...but I am struggling with the scenario where this causes problems in
> practice, where vm_page_prot is the wrong cachemode.

Yeah, it's all confusing.

But as long as we don't conclude that pfnmap_setup_cachemode_pfn() can 
be removed entirely (esp. also from pte / pmd case), this seems to be 
the right thing to do and was accidental change in the introducing commit.

Is it actually stable material? I don't know, but possibly getting 
cachemodes wrongs sounds ... bad?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-12  1:56   ` Alistair Popple
@ 2025-06-12  6:55     ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12  6:55 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador, stable

On 12.06.25 03:56, Alistair Popple wrote:
> On Wed, Jun 11, 2025 at 02:06:52PM +0200, David Hildenbrand wrote:
>> We setup the cache mode but ... don't forward the updated pgprot to
>> insert_pfn_pud().
>>
>> Only a problem on x86-64 PAT when mapping PFNs using PUDs that
>> require a special cachemode.
>>
>> Fix it by using the proper pgprot where the cachemode was setup.
>>
>> Identified by code inspection.
>>
>> Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/huge_memory.c | 7 +++----
>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d3e66136e41a3..49b98082c5401 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1516,10 +1516,9 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
>>   }
>>   
>>   static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>> -		pud_t *pud, pfn_t pfn, bool write)
>> +		pud_t *pud, pfn_t pfn, pgprot_t prot, bool write)
>>   {
>>   	struct mm_struct *mm = vma->vm_mm;
>> -	pgprot_t prot = vma->vm_page_prot;
>>   	pud_t entry;
>>   
>>   	if (!pud_none(*pud)) {
>> @@ -1581,7 +1580,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>>   	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>>   
>>   	ptl = pud_lock(vma->vm_mm, vmf->pud);
>> -	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
>> +	insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write);
>>   	spin_unlock(ptl);
>>   
>>   	return VM_FAULT_NOPAGE;
>> @@ -1625,7 +1624,7 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>>   		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
>>   	}
>>   	insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)),
>> -		write);
>> +		       vma->vm_page_prot, write);
> 
> Actually It's not immediately obvious to me why we don't call track_pfn_insert()
> and forward the pgprot here as well.

(track_pfn_insert is now called pfnmap_setup_cachemode_pfn)

Prior to me adding vmf_insert_folio_pud()
> device DAX would call vmf_insert_pfn_pud(), and the intent at least seems to
> have been to change pgprot for that (and we did for the PTE/PMD versions).

It's only for PFNMAP mappings as far as I understand. I think this is 
mostly about drivers mapping actual weird stuff with weird memory types 
(e.g., vfio mapping mmio etc) into the page tables, that does not have a 
struct page.

> 
> However now that the ZONE_DEVICE folios are refcounted normally I switched
> device dax to using vmf_insert_folio_*() which never changes pgprot based on x86
> PAT. So I think we probably need to either add that to vmf_insert_folio_*() or
> a new variant or make it the responsibility of callers to figure out the correct
> pgprot.

I would assume that for ZONE_DEVICE the cachemode is always simpler 
(e.g., no MMIO?)?

In any case, I would assume ZONE_DEVICE only ended up "accidentally" 
triggering it and that it didn't make a difference.

Observe that pfnmap_setup_cachemode_pfn() is only called from 
vmf_insert_pfn_*() ... well, and our ugly friend __vm_insert_mixed() 
that similarly inserts a PFN mapping.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()
  2025-06-12  2:17   ` Alistair Popple
@ 2025-06-12  7:06     ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12  7:06 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On 12.06.25 04:17, Alistair Popple wrote:
> On Wed, Jun 11, 2025 at 02:06:53PM +0200, David Hildenbrand wrote:
>> Marking PMDs that map a "normal" refcounted folios as special is
>> against our rules documented for vm_normal_page().
>>
>> Fortunately, there are not that many pmd_special() check that can be
>> mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that
>> would get this wrong right now are rather harmless: e.g., none so far
>> bases decisions whether to grab a folio reference on that decision.
>>
>> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
>> implications as it seems.
>>
>> Getting this right will get more important as we use
>> folio_normal_page_pmd() in more places.
>>
>> Fix it by teaching insert_pfn_pmd() to properly handle folios and
>> pfns -- moving refcount/mapcount/etc handling in there, renaming it to
>> insert_pmd(), and distinguishing between both cases using a new simple
>> "struct folio_or_pfn" structure.
>>
>> Use folio_mk_pmd() to create a pmd for a folio cleanly.
>>
>> Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()")
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/huge_memory.c | 58 ++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 39 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 49b98082c5401..7e3e9028873e5 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1372,9 +1372,17 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>   	return __do_huge_pmd_anonymous_page(vmf);
>>   }
>>   
>> -static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>> -		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
>> -		pgtable_t pgtable)
>> +struct folio_or_pfn {
>> +	union {
>> +		struct folio *folio;
>> +		pfn_t pfn;
>> +	};
>> +	bool is_folio;
>> +};
> 
> I know it's simple, but I'm still not a fan particularly as these types of
> patterns tend to proliferate once introduced. See below for a suggestion.

It's much better than abusing pfn_t for folios -- and I don't 
particularly see a problem with this pattern here as long as it stays in 
this file.

> 
>> +static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>> +		pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
>> +		bool write, pgtable_t pgtable)
>>   {
>>   	struct mm_struct *mm = vma->vm_mm;
>>   	pmd_t entry;
>> @@ -1382,8 +1390,11 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   	lockdep_assert_held(pmd_lockptr(mm, pmd));
>>   
>>   	if (!pmd_none(*pmd)) {
>> +		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>> +					  pfn_t_to_pfn(fop.pfn);
>> +
>>   		if (write) {
>> -			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
>> +			if (pmd_pfn(*pmd) != pfn) {
>>   				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
>>   				return -EEXIST;
>>   			}
>> @@ -1396,11 +1407,19 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   		return -EEXIST;
>>   	}
>>   
>> -	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>> -	if (pfn_t_devmap(pfn))
>> -		entry = pmd_mkdevmap(entry);
>> -	else
>> -		entry = pmd_mkspecial(entry);
>> +	if (fop.is_folio) {
>> +		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>> +
>> +		folio_get(fop.folio);
>> +		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>> +		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
>> +	} else {
>> +		entry = pmd_mkhuge(pfn_t_pmd(fop.pfn, prot));
>> +		if (pfn_t_devmap(fop.pfn))
>> +			entry = pmd_mkdevmap(entry);
>> +		else
>> +			entry = pmd_mkspecial(entry);
>> +	}
> 
> Could we change insert_pfn_pmd() to insert_pmd_entry() and have callers call
> something like pfn_to_pmd_entry() or folio_to_pmd_entry() to create the pmd_t
> entry as appropriate, which is then passed to insert_pmd_entry() to do the bits
> common to both?

Yeah, I had that idea as well but discarded it, because the 
refcounting+mapcounting handling is better placed where we are actually 
inserting the pmd (not possibly only upgrading permissions of an 
existing mapping). Avoid 4-line comments as the one we are removing in 
patch #3 ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-12  4:20   ` Dan Williams
@ 2025-06-12  7:18     ` David Hildenbrand
  2025-06-12  8:27       ` David Hildenbrand
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12  7:18 UTC (permalink / raw)
  To: Dan Williams, Alistair Popple
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Oscar Salvador, marc.herbert

On 12.06.25 06:20, Dan Williams wrote:
> Alistair Popple wrote:
>> On Wed, Jun 11, 2025 at 02:06:51PM +0200, David Hildenbrand wrote:
>>> This is v2 of
>>> 	"[PATCH v1 0/2] mm/huge_memory: don't mark refcounted pages special
>>> 	 in vmf_insert_folio_*()"
>>> Now with one additional fix, based on mm/mm-unstable.
>>>
>>> While working on improving vm_normal_page() and friends, I stumbled
>>> over this issues: refcounted "normal" pages must not be marked
>>> using pmd_special() / pud_special().
>>>
>>> Fortunately, so far there doesn't seem to be serious damage.
>>>
>>> I spent too much time trying to get the ndctl tests mentioned by Dan
>>> running (.config tweaks, memmap= setup, ... ), without getting them to
>>> pass even without these patches. Some SKIP, some FAIL, some sometimes
>>> suddenly SKIP on first invocation, ... instructions unclear or the tests
>>> are shaky. This is how far I got:
>>
>> FWIW I had a similar experience, although I eventually got the FAIL cases below
>> to pass. I forget exactly what I needed to tweak for that though :-/
> 
> Add Marc who has been working to clean the documentation up to solve the
> reproducibility problem with standing up new environments to run these
> tests.

I was about to send some doc improvements myself, but I didn't manage to 
get the tests running in the first place ... even after trying hard :)

I think there is also one issue with a test that requires you to 
actually install ndctl ... and some tests seem to temporarily fail with 
weird issues regarding "file size problems with /proc/kallsyms", 
whereby, ... there are no such file size problems :)

All a bit shaky. The "memmap=" stuff is not documented anywhere for the 
tests, which is required for some tests I think. Maybe it should be 
added, not sure how big of an area we actually need, though.

> 
> http://lore.kernel.org/20250521002640.1700283-1-marc.herbert@linux.intel.com
> 

I think I have CONFIG_XFS_FS=m (instead of y) and CONFIG_DAX=y (instead 
of =m), and CONFIG_NFIT_SECURITY_DEBUG not set (instead of =y).

Let me try with these settings adjusted.

> There is also the run_qemu project that automates build an environment for this.
> 
> https://github.com/pmem/run_qemu
> 
> ...but comes with its own set of quirks.
> 
> I have the following fixups applied to my environment to get his going on
> Fedora 42 with v6.16-rc1:
> 
> diff --git a/README.md b/README.md
> index 37314db7a155..8e06908d5921 100644
> --- a/README.md
> +++ b/README.md
> @@ -84,6 +84,11 @@ loaded.  To build and install nfit_test.ko:
>      CONFIG_TRANSPARENT_HUGEPAGE=y
>      ```
>   
> +1. Install the following packages, (Fedora instructions):
> +   ```
> +   dnf install e2fsprogs xfsprogs parted jq trace-cmd hostname fio fio-engine-dev-dax
> +   ```
> +
>   1. Build and install the unit test enabled libnvdimm modules in the
>      following order.  The unit test modules need to be in place prior to
>      the `depmod` that runs during the final `modules_install`
> diff --git a/test/dax.sh b/test/dax.sh
> index 3ffbc8079eba..98faaf0eb9b2 100755
> --- a/test/dax.sh
> +++ b/test/dax.sh
> @@ -37,13 +37,14 @@ run_test() {
>   	rc=1
>   	while read -r p; do
>   		[[ $p ]] || continue
> +		[[ $p == cpus=* ]] && continue
>   		if [ "$count" -lt 10 ]; then
>   			if [ "$p" != "0x100" ] && [ "$p" != "NOPAGE" ]; then
>   				cleanup "$1"
>   			fi
>   		fi
>   		count=$((count + 1))
> -	done < <(trace-cmd report | awk '{ print $21 }')
> +	done < <(trace-cmd report | awk '{ print $NF }')
>   
>   	if [ $count -lt 10 ]; then
>   		cleanup "$1"
> 
> In the meantime, do not hesitate to ask me to run these tests.

Yes, thanks, and thanks for running these tests.

> 
> FWIW with these patches on top of -rc1 I get:
> 
> ---
> 
> [root@host ndctl]# meson test -C build --suite ndctl:dax
> ninja: Entering directory `/root/git/ndctl/build'
> [168/168] Linking target ndctl/ndctl
>   1/13 ndctl:dax / daxdev-errors.sh          OK              12.60s
>   2/13 ndctl:dax / multi-dax.sh              OK               2.47s
>   3/13 ndctl:dax / sub-section.sh            OK               6.30s
>   4/13 ndctl:dax / dax-dev                   OK               0.04s
>   5/13 ndctl:dax / dax-ext4.sh               OK               3.04s
>   6/13 ndctl:dax / dax-xfs.sh                OK               3.10s
>   7/13 ndctl:dax / device-dax                OK               9.66s
>   8/13 ndctl:dax / revoke-devmem             OK               0.22s
>   9/13 ndctl:dax / device-dax-fio.sh         OK              32.32s
> 10/13 ndctl:dax / daxctl-devices.sh         OK               2.31s
> 11/13 ndctl:dax / daxctl-create.sh          SKIP             0.25s   exit status 77
> 12/13 ndctl:dax / dm.sh                     OK               1.00s
> 13/13 ndctl:dax / mmap.sh                   OK              62.27s
> 
> Ok:                12
> Fail:              0
> Skipped:           1
> 
> Full log written to /root/git/ndctl/build/meson-logs/testlog.txt
> 
> ---
> 
> Note that the daxctl-create.sh skip is a known unrelated v6.16-rc1 regression
> fixed with this set:
> 
> http://lore.kernel.org/20250607033228.1475625-1-dan.j.williams@intel.com
> 
> You can add:
> 
> Tested-by: Dan Williams <dan.j.williams@intel.com>
> 

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-11 23:08 ` [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes Andrew Morton
@ 2025-06-12  7:34   ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12  7:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Alistair Popple,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On 12.06.25 01:08, Andrew Morton wrote:
> On Wed, 11 Jun 2025 14:06:51 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>> While working on improving vm_normal_page() and friends, I stumbled
>> over this issues: refcounted "normal" pages must not be marked
>> using pmd_special() / pud_special().
> 
> Why is this?

The two patches for that refer to the rules documented for 
vm_normal_page(), how it could mislead pmd_special()/pud_special() 
users, and how the harm so far is fortunately still limited.

It's all about how we identify refcounted folios vs. pfn mappings / 
decide what's normal and what's special.

> 
>>
>> ...
>>
>> I spent too much time trying to get the ndctl tests mentioned by Dan
>> running (.config tweaks, memmap= setup, ... ), without getting them to
>> pass even without these patches. Some SKIP, some FAIL, some sometimes
>> suddenly SKIP on first invocation, ... instructions unclear or the tests
>> are shaky. This is how far I got:
> 
> I won't include this in the [0/N] - it doesn't seem helpful for future
> readers of the patchset.

Yes, trim it down to "ran ndctl tests, tests are shaky and ahrd to run, 
but the results indicate that the relevant stuff seems to keep working".

... combined with the Tested-by by Dan.

> 
> I'll give the patchset a run in mm-new, but it feels like some more
> baking is needed?

Fortunately Dan and Alistair managed to get the tests run properly. So I 
don't have to waste another valuable 4 hours of my life on testing some 
simple fixes that only stand in between me and doing the actual work in 
that area I want to get done.

> 
> The [1/N] has cc:stable but there's nothing in there to explain this
> decision.  How does the issues affect userspace?

My reasoning was: Getting cachemodes in page table entries wrong sounds 
... bad? At least to me :)

PAT code is confusing (when/how we could we actually mess up the 
cachemode?), so it's hard to decide when this actually hits, and what 
the exact results in which scenario would be. I tried to find out, but 
cannot spend another hour digging through that horrible code.

So if someone has a problem with "stable" here, we can drop it. But the 
fix is simple.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-12  7:18     ` David Hildenbrand
@ 2025-06-12  8:27       ` David Hildenbrand
  2025-06-12 16:56         ` Marc Herbert
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12  8:27 UTC (permalink / raw)
  To: Dan Williams, Alistair Popple
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Oscar Salvador, marc.herbert

On 12.06.25 09:18, David Hildenbrand wrote:
> On 12.06.25 06:20, Dan Williams wrote:
>> Alistair Popple wrote:
>>> On Wed, Jun 11, 2025 at 02:06:51PM +0200, David Hildenbrand wrote:
>>>> This is v2 of
>>>> 	"[PATCH v1 0/2] mm/huge_memory: don't mark refcounted pages special
>>>> 	 in vmf_insert_folio_*()"
>>>> Now with one additional fix, based on mm/mm-unstable.
>>>>
>>>> While working on improving vm_normal_page() and friends, I stumbled
>>>> over this issues: refcounted "normal" pages must not be marked
>>>> using pmd_special() / pud_special().
>>>>
>>>> Fortunately, so far there doesn't seem to be serious damage.
>>>>
>>>> I spent too much time trying to get the ndctl tests mentioned by Dan
>>>> running (.config tweaks, memmap= setup, ... ), without getting them to
>>>> pass even without these patches. Some SKIP, some FAIL, some sometimes
>>>> suddenly SKIP on first invocation, ... instructions unclear or the tests
>>>> are shaky. This is how far I got:
>>>
>>> FWIW I had a similar experience, although I eventually got the FAIL cases below
>>> to pass. I forget exactly what I needed to tweak for that though :-/
>>
>> Add Marc who has been working to clean the documentation up to solve the
>> reproducibility problem with standing up new environments to run these
>> tests.
> 
> I was about to send some doc improvements myself, but I didn't manage to
> get the tests running in the first place ... even after trying hard :)
> 
> I think there is also one issue with a test that requires you to
> actually install ndctl ... and some tests seem to temporarily fail with
> weird issues regarding "file size problems with /proc/kallsyms",
> whereby, ... there are no such file size problems :)
> 
> All a bit shaky. The "memmap=" stuff is not documented anywhere for the
> tests, which is required for some tests I think. Maybe it should be
> added, not sure how big of an area we actually need, though.
> 
>>
>> http://lore.kernel.org/20250521002640.1700283-1-marc.herbert@linux.intel.com
>>
> 
> I think I have CONFIG_XFS_FS=m (instead of y) and CONFIG_DAX=y (instead
> of =m), and CONFIG_NFIT_SECURITY_DEBUG not set (instead of =y).
> 
> Let me try with these settings adjusted.

Yeah, no. Unfortunately doesn't make it work with my debug config. Maybe with the
defconfig as raised by Marc it would do ... maybe will try that later.

# meson test -C build --suite ndctl:dax
ninja: Entering directory `/root/ndctl/build'
[1/70] Generating version.h with a custom command
  1/13 ndctl:dax / daxdev-errors.sh          OK              14.60s
  2/13 ndctl:dax / multi-dax.sh              OK               4.28s
  3/13 ndctl:dax / sub-section.sh            SKIP             0.25s   exit status 77
  4/13 ndctl:dax / dax-dev                   OK               1.00s
  5/13 ndctl:dax / dax-ext4.sh               OK              23.60s
  6/13 ndctl:dax / dax-xfs.sh                OK              23.74s
  7/13 ndctl:dax / device-dax                OK              40.61s
  8/13 ndctl:dax / revoke-devmem             OK               0.98s
  9/13 ndctl:dax / device-dax-fio.sh         SKIP             0.10s   exit status 77
10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.16s   exit status 77
11/13 ndctl:dax / daxctl-create.sh          FAIL             2.53s   exit status 1
>>> DAXCTL=/root/ndctl/build/daxctl/daxctl DATA_PATH=/root/ndctl/test MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MALLOC_PERTURB_=167 LD_LIBRARY_PATH=/root/ndctl/build/cxl/lib:/root/ndctl/build/daxctl/lib:/root/ndctl/build/ndctl/lib TEST_PATH=/root/ndctl/build/test UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 NDCTL=/root/ndctl/build/ndctl/ndctl ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 /root/ndctl/test/daxctl-create.sh

12/13 ndctl:dax / dm.sh                     FAIL             0.24s   exit status 1
>>> DAXCTL=/root/ndctl/build/daxctl/daxctl DATA_PATH=/root/ndctl/test MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 LD_LIBRARY_PATH=/root/ndctl/build/cxl/lib:/root/ndctl/build/daxctl/lib:/root/ndctl/build/ndctl/lib TEST_PATH=/root/ndctl/build/test MALLOC_PERTURB_=27 NDCTL=/root/ndctl/build/ndctl/ndctl ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 /root/ndctl/test/dm.sh

13/13 ndctl:dax / mmap.sh                   OK             343.67s

Ok:                 8
Expected Fail:      0
Fail:               2
Unexpected Pass:    0
Skipped:            3
Timeout:            0

Full log written to /root/ndctl/build/meson-logs/testlog.txt


After compilation, I can see that I again have "CONFIG_DAX=y" in my config.

And for the DAX setting in "make menuconfig" I can see:

Symbol: DAX [=y]
  ...
  Selected by [y]:
  - FS_DAX [=y] && MMU [=y] && (ZONE_DEVICE [=y] || FS_DAX_LIMITED [=n]
  Selected by [m]:
  - BLK_DEV_PMEM [=m] && LIBNVDIMM [=m]

So I guess, as requested in the doc "CONFIG_FS_DAX=y" combined with
"CONFIG_DAX=m" is impossible to achieve?


===

sub-section.sh complains about

++ /root/ndctl/build/ndctl/ndctl list -R -b ACPI.NFIT
+ json=
++ echo
++ jq -r '[.[] | select(.available_size >= 67108864)][0].dev'
+ region=
++ echo
++ jq -r '[.[] | select(.available_size >= 67108864)][0].available_size'
+ avail=
+ '[' -z ']'
+ exit 77

Not sure what's the problem in my environment. I thought we would be emulating
ACPI.NFIT.

===

device-dax-fio.sh complains about

kernel 6.16.0-rc1-00069-g0ede5baa0b46: missing fio, skipping...

So I guess I just need to install "fio" to make it fly.

Yes, with that the test is passing now.

===

daxctl-devices.sh complains about

++ reset_dev
++ /root/ndctl/build/ndctl/ndctl destroy-namespace -f -b ACPI.NFIT 'Error at linn
e 33'
error destroying namespaces: No such device or address
destroyed 0 namespaces
++ exit 77


No idea.

===

daxctl-create.sh complains about

+ /root/ndctl/build/daxctl/daxctl reconfigure-device -m devdax -f dax1.0
libdaxctl: daxctl_dev_enable: dax1.0: failed to enable
error reconfiguring devices: Invalid argument
reconfigured 0 devices
++ cleanup 54
++ printf 'Error at line %d\n' 54
++ [[ -n dax1.0 ]]
++ reset_dax
++ test -n dax1.0
++ /root/ndctl/build/daxctl/daxctl disable-device -r 1 all
disabled 1 device
++ /root/ndctl/build/daxctl/daxctl destroy-device -r 1 all
destroyed 1 device
++ /root/ndctl/build/daxctl/daxctl reconfigure-device -s '' dax1.0
reconfigured 1 device
++ exit 1


Again, no idea ... :(


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-11 12:06 ` [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() David Hildenbrand
  2025-06-12  1:56   ` Alistair Popple
  2025-06-12  4:34   ` Dan Williams
@ 2025-06-12 15:28   ` Lorenzo Stoakes
  2025-06-12 15:36     ` David Hildenbrand
  2025-06-12 17:59   ` Jason Gunthorpe
  3 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-06-12 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador, stable

On Wed, Jun 11, 2025 at 02:06:52PM +0200, David Hildenbrand wrote:
> We setup the cache mode but ... don't forward the updated pgprot to
> insert_pfn_pud().
>
> Only a problem on x86-64 PAT when mapping PFNs using PUDs that
> require a special cachemode.
>
> Fix it by using the proper pgprot where the cachemode was setup.
>
> Identified by code inspection.
>
> Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Nice catch!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d3e66136e41a3..49b98082c5401 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1516,10 +1516,9 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
>  }
>
>  static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
> -		pud_t *pud, pfn_t pfn, bool write)
> +		pud_t *pud, pfn_t pfn, pgprot_t prot, bool write)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pgprot_t prot = vma->vm_page_prot;
>  	pud_t entry;
>
>  	if (!pud_none(*pud)) {
> @@ -1581,7 +1580,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>
>  	ptl = pud_lock(vma->vm_mm, vmf->pud);
> -	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> +	insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write);
>  	spin_unlock(ptl);
>
>  	return VM_FAULT_NOPAGE;
> @@ -1625,7 +1624,7 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>  		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
>  	}
>  	insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)),
> -		write);
> +		       vma->vm_page_prot, write);
>  	spin_unlock(ptl);
>
>  	return VM_FAULT_NOPAGE;
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-12 15:28   ` Lorenzo Stoakes
@ 2025-06-12 15:36     ` David Hildenbrand
  2025-06-12 15:59       ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12 15:36 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador, stable

On 12.06.25 17:28, Lorenzo Stoakes wrote:
> On Wed, Jun 11, 2025 at 02:06:52PM +0200, David Hildenbrand wrote:
>> We setup the cache mode but ... don't forward the updated pgprot to
>> insert_pfn_pud().
>>
>> Only a problem on x86-64 PAT when mapping PFNs using PUDs that
>> require a special cachemode.
>>
>> Fix it by using the proper pgprot where the cachemode was setup.
>>
>> Identified by code inspection.
>>
>> Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Nice catch!
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks! What's your opinion on stable? Really hard to judge the impact ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-12 15:36     ` David Hildenbrand
@ 2025-06-12 15:59       ` Lorenzo Stoakes
  2025-06-12 16:00         ` David Hildenbrand
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-06-12 15:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador, stable

On Thu, Jun 12, 2025 at 05:36:35PM +0200, David Hildenbrand wrote:
> On 12.06.25 17:28, Lorenzo Stoakes wrote:
> > On Wed, Jun 11, 2025 at 02:06:52PM +0200, David Hildenbrand wrote:
> > > We setup the cache mode but ... don't forward the updated pgprot to
> > > insert_pfn_pud().
> > >
> > > Only a problem on x86-64 PAT when mapping PFNs using PUDs that
> > > require a special cachemode.
> > >
> > > Fix it by using the proper pgprot where the cachemode was setup.
> > >
> > > Identified by code inspection.
> > >
> > > Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")

Ha! I don't even remember doing that patch... hm did I introduce this -ignoring
cache- thing? Sorry! :P

> > > Cc: <stable@vger.kernel.org>
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > Nice catch!
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks! What's your opinion on stable? Really hard to judge the impact ...

I think it makes sense? This is currently incorrect so let's do the right thing
and backport.

I think as per Dan it's probably difficult to picture this causing a problem,
but on principle I think this is correct, and I don't see any harm in
backporting?

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-12 15:59       ` Lorenzo Stoakes
@ 2025-06-12 16:00         ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12 16:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador, stable

On 12.06.25 17:59, Lorenzo Stoakes wrote:
> On Thu, Jun 12, 2025 at 05:36:35PM +0200, David Hildenbrand wrote:
>> On 12.06.25 17:28, Lorenzo Stoakes wrote:
>>> On Wed, Jun 11, 2025 at 02:06:52PM +0200, David Hildenbrand wrote:
>>>> We setup the cache mode but ... don't forward the updated pgprot to
>>>> insert_pfn_pud().
>>>>
>>>> Only a problem on x86-64 PAT when mapping PFNs using PUDs that
>>>> require a special cachemode.
>>>>
>>>> Fix it by using the proper pgprot where the cachemode was setup.
>>>>
>>>> Identified by code inspection.
>>>>
>>>> Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")
> 
> Ha! I don't even remember doing that patch... hm did I introduce this -ignoring
> cache- thing? Sorry! :P

:)

> 
>>>> Cc: <stable@vger.kernel.org>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> Nice catch!
>>>
>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>
>> Thanks! What's your opinion on stable? Really hard to judge the impact ...
> 
> I think it makes sense? This is currently incorrect so let's do the right thing
> and backport.
> 
> I think as per Dan it's probably difficult to picture this causing a problem,
> but on principle I think this is correct, and I don't see any harm in
> backporting?

Same opinion, thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()
  2025-06-11 12:06 ` [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd() David Hildenbrand
  2025-06-12  2:17   ` Alistair Popple
  2025-06-12  4:36   ` Dan Williams
@ 2025-06-12 16:10   ` Lorenzo Stoakes
  2025-06-13  7:44     ` David Hildenbrand
  2025-06-12 18:02   ` Jason Gunthorpe
  3 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-06-12 16:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On Wed, Jun 11, 2025 at 02:06:53PM +0200, David Hildenbrand wrote:
> Marking PMDs that map a "normal" refcounted folios as special is
> against our rules documented for vm_normal_page().
>
> Fortunately, there are not that many pmd_special() check that can be
> mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that
> would get this wrong right now are rather harmless: e.g., none so far
> bases decisions whether to grab a folio reference on that decision.
>
> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
> implications as it seems.
>
> Getting this right will get more important as we use
> folio_normal_page_pmd() in more places.
>
> Fix it by teaching insert_pfn_pmd() to properly handle folios and
> pfns -- moving refcount/mapcount/etc handling in there, renaming it to
> insert_pmd(), and distinguishing between both cases using a new simple
> "struct folio_or_pfn" structure.
>
> Use folio_mk_pmd() to create a pmd for a folio cleanly.
>
> Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()")
> Signed-off-by: David Hildenbrand <david@redhat.com>

Looks good to me, checked that the logic remains the same. Some micro
nits/thoughts below. So:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c | 58 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 39 insertions(+), 19 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 49b98082c5401..7e3e9028873e5 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1372,9 +1372,17 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  	return __do_huge_pmd_anonymous_page(vmf);
>  }
>
> -static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> -		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
> -		pgtable_t pgtable)
> +struct folio_or_pfn {
> +	union {
> +		struct folio *folio;
> +		pfn_t pfn;
> +	};
> +	bool is_folio;
> +};

Interesting... I guess a memdesc world will make this easy... maybe? :)

But this is a neat way of passing this.

Another mega nit is mayyybe we could have a macro for making these like:


#define DECLARE_FOP_PFN(name_, pfn_)		\
	struct folio_or_pfn name_ {		\
		.pfn = pfn_,			\
		.is_folio = false,		\
	}

#define DECLARE_FOP_FOLIO(name_, folio_)	\
	struct folio_or_pfn name_ {		\
		.folio = folio_,		\
		.is_folio = true,		\
	}

But yeah maybe overkill for this small usage in this file.

> +
> +static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> +		pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
> +		bool write, pgtable_t pgtable)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pmd_t entry;
> @@ -1382,8 +1390,11 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	lockdep_assert_held(pmd_lockptr(mm, pmd));
>
>  	if (!pmd_none(*pmd)) {
> +		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
> +					  pfn_t_to_pfn(fop.pfn);
> +
>  		if (write) {
> -			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
> +			if (pmd_pfn(*pmd) != pfn) {
>  				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
>  				return -EEXIST;
>  			}
> @@ -1396,11 +1407,19 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		return -EEXIST;
>  	}
>
> -	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> -	if (pfn_t_devmap(pfn))
> -		entry = pmd_mkdevmap(entry);
> -	else
> -		entry = pmd_mkspecial(entry);
> +	if (fop.is_folio) {
> +		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
> +
> +		folio_get(fop.folio);
> +		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> +		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> +	} else {
> +		entry = pmd_mkhuge(pfn_t_pmd(fop.pfn, prot));

Mega micro annoying nit - in above branch you have a newline after entry =, here
you don't. Maybe should add here also?

> +		if (pfn_t_devmap(fop.pfn))
> +			entry = pmd_mkdevmap(entry);
> +		else
> +			entry = pmd_mkspecial(entry);
> +	}
>  	if (write) {
>  		entry = pmd_mkyoung(pmd_mkdirty(entry));
>  		entry = maybe_pmd_mkwrite(entry, vma);
> @@ -1431,6 +1450,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	unsigned long addr = vmf->address & PMD_MASK;
>  	struct vm_area_struct *vma = vmf->vma;
>  	pgprot_t pgprot = vma->vm_page_prot;
> +	struct folio_or_pfn fop = {
> +		.pfn = pfn,
> +	};
>  	pgtable_t pgtable = NULL;
>  	spinlock_t *ptl;
>  	int error;
> @@ -1458,8 +1480,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>
>  	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> -	error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
> -			pgtable);
> +	error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
> +			   pgtable);
>  	spin_unlock(ptl);
>  	if (error && pgtable)
>  		pte_free(vma->vm_mm, pgtable);
> @@ -1474,6 +1496,10 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
>  	struct vm_area_struct *vma = vmf->vma;
>  	unsigned long addr = vmf->address & PMD_MASK;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct folio_or_pfn fop = {
> +		.folio = folio,
> +		.is_folio = true,
> +	};
>  	spinlock_t *ptl;
>  	pgtable_t pgtable = NULL;
>  	int error;
> @@ -1491,14 +1517,8 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
>  	}
>
>  	ptl = pmd_lock(mm, vmf->pmd);
> -	if (pmd_none(*vmf->pmd)) {
> -		folio_get(folio);
> -		folio_add_file_rmap_pmd(folio, &folio->page, vma);
> -		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> -	}
> -	error = insert_pfn_pmd(vma, addr, vmf->pmd,
> -			pfn_to_pfn_t(folio_pfn(folio)), vma->vm_page_prot,
> -			write, pgtable);
> +	error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
> +			   write, pgtable);
>  	spin_unlock(ptl);
>  	if (error && pgtable)
>  		pte_free(mm, pgtable);
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-11 12:06 [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes David Hildenbrand
                   ` (4 preceding siblings ...)
  2025-06-12  2:26 ` Alistair Popple
@ 2025-06-12 16:19 ` Lorenzo Stoakes
  2025-06-12 16:22   ` David Hildenbrand
  5 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-06-12 16:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

FWIW I did a basic build/mm self tests run locally and all looking good!

On Wed, Jun 11, 2025 at 02:06:51PM +0200, David Hildenbrand wrote:
> This is v2 of
> 	"[PATCH v1 0/2] mm/huge_memory: don't mark refcounted pages special
> 	 in vmf_insert_folio_*()"
> Now with one additional fix, based on mm/mm-unstable.
>
> While working on improving vm_normal_page() and friends, I stumbled
> over this issues: refcounted "normal" pages must not be marked
> using pmd_special() / pud_special().
>
> Fortunately, so far there doesn't seem to be serious damage.
>
> I spent too much time trying to get the ndctl tests mentioned by Dan
> running (.config tweaks, memmap= setup, ... ), without getting them to
> pass even without these patches. Some SKIP, some FAIL, some sometimes
> suddenly SKIP on first invocation, ... instructions unclear or the tests
> are shaky. This is how far I got:
>
> # meson test -C build --suite ndctl:dax
> ninja: Entering directory `/root/ndctl/build'
> [1/70] Generating version.h with a custom command
>  1/13 ndctl:dax / daxdev-errors.sh          OK              15.08s
>  2/13 ndctl:dax / multi-dax.sh              OK               5.80s
>  3/13 ndctl:dax / sub-section.sh            SKIP             0.39s   exit status 77
>  4/13 ndctl:dax / dax-dev                   OK               1.37s
>  5/13 ndctl:dax / dax-ext4.sh               OK              32.70s
>  6/13 ndctl:dax / dax-xfs.sh                OK              29.43s
>  7/13 ndctl:dax / device-dax                OK              44.50s
>  8/13 ndctl:dax / revoke-devmem             OK               0.98s
>  9/13 ndctl:dax / device-dax-fio.sh         SKIP             0.10s   exit status 77
> 10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.16s   exit status 77
> 11/13 ndctl:dax / daxctl-create.sh          FAIL             2.61s   exit status 1
> 12/13 ndctl:dax / dm.sh                     FAIL             0.23s   exit status 1
> 13/13 ndctl:dax / mmap.sh                   OK             437.86s
>
> So, no idea if this series breaks something, because the tests are rather
> unreliable. I have plenty of other debug settings on, maybe that's a
> problem? I guess if the FS tests and mmap test pass, we're mostly good.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Oscar Salvador <osalvador@suse.de>
>
>
> v1 -> v2:
> * "mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()"
>  -> Added after stumbling over that
> * Modified the other tests to reuse the existing function by passing a
>   new struct
> * Renamed the patches to talk about "folios" instead of pages and adjusted
>   the patch descriptions
> * Dropped RB/TB from Dan and Oscar due to the changes
>
> David Hildenbrand (3):
>   mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
>   mm/huge_memory: don't mark refcounted folios special in
>     vmf_insert_folio_pmd()
>   mm/huge_memory: don't mark refcounted folios special in
>     vmf_insert_folio_pud()
>
>  include/linux/mm.h |  19 +++++++-
>  mm/huge_memory.c   | 110 +++++++++++++++++++++++++++------------------
>  2 files changed, 85 insertions(+), 44 deletions(-)
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-12 16:19 ` Lorenzo Stoakes
@ 2025-06-12 16:22   ` David Hildenbrand
  2025-06-12 16:30     ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12 16:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On 12.06.25 18:19, Lorenzo Stoakes wrote:
> FWIW I did a basic build/mm self tests run locally and all looking good!

Thanks! I have another series based on this series coming up ... but 
struggling to get !CONFIG_ARCH_HAS_PTE_SPECIAL tested "easily" :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-12 16:22   ` David Hildenbrand
@ 2025-06-12 16:30     ` Lorenzo Stoakes
  0 siblings, 0 replies; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-06-12 16:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On Thu, Jun 12, 2025 at 06:22:32PM +0200, David Hildenbrand wrote:
> On 12.06.25 18:19, Lorenzo Stoakes wrote:
> > FWIW I did a basic build/mm self tests run locally and all looking good!
>
> Thanks! I have another series based on this series coming up ... but
> struggling to get !CONFIG_ARCH_HAS_PTE_SPECIAL tested "easily" :)

Hm which arches don't set it?

Filtering through:

arm - If !ARM_LPAE
csky
hexagon
m68k
microblaze
mips - If 32-bit or !CPU_HAS_RIXI
nios2
openrisc
um
xtensa

So the usual suspects of museum pieces and museum pieces on life-support for
some reason but also... usermode linux?

Might that be the easiest to play with?

I got this list from a basic grep for 'select ARCH_HAS_PTE_SPECIAL' so I'm not
sure if um imports some other arch's kconfig or there is some other way to set
it but probably this criteria is accurate...

IMO: criteria for arch removal (or in case of um - adjustment :) - 32-bit
(kernel), !ARCH_HAS_PTE_SPECIAL, nommu

Of course, pipe dreams...

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()
  2025-06-11 12:06 ` [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud() David Hildenbrand
  2025-06-12  4:40   ` Dan Williams
@ 2025-06-12 16:49   ` Lorenzo Stoakes
  2025-06-12 17:00     ` David Hildenbrand
  2025-06-12 18:02   ` Jason Gunthorpe
  2 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-06-12 16:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On Wed, Jun 11, 2025 at 02:06:54PM +0200, David Hildenbrand wrote:
> Marking PUDs that map a "normal" refcounted folios as special is
> against our rules documented for vm_normal_page().

Might be worth referring to specifically which rule. I'm guessing it's the
general one of special == don't touch (from vm_normal_page() comment):

/*
 * vm_normal_page -- This function gets the "struct page" associated with a pte.
 *
 * "Special" mappings do not wish to be associated with a "struct page" (either
 * it doesn't exist, or it exists but they don't want to touch it). In this
 * case, NULL is returned here. "Normal" mappings do have a struct page.
 *
 * ...
 *
 */

But don't we already violate this E.g.:

		if (vma->vm_ops && vma->vm_ops->find_special_page)
			return vma->vm_ops->find_special_page(vma, addr);

I mean this in itself perhaps means we should update this comment to say 'except
when file-backed and there is a find_special_page() hook'.

>
> Fortunately, there are not that many pud_special() check that can be
> mislead and are right now rather harmless: e.g., none so far
> bases decisions whether to grab a folio reference on that decision.
>
> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
> implications as it seems.
>
> Getting this right will get more important as we introduce
> folio_normal_page_pud() and start using it in more place where we
> currently special-case based on other VMA flags.
>
> Fix it just like we fixed vmf_insert_folio_pmd().
>
> Add folio_mk_pud() to mimic what we do with folio_mk_pmd().
>
> Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()")
> Signed-off-by: David Hildenbrand <david@redhat.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Couple nits/comments below.

> ---
>  include/linux/mm.h | 19 ++++++++++++++++-
>  mm/huge_memory.c   | 51 +++++++++++++++++++++++++---------------------
>  2 files changed, 46 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa538feaa8d95..912b6d40a12d6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1816,7 +1816,24 @@ static inline pmd_t folio_mk_pmd(struct folio *folio, pgprot_t pgprot)
>  {
>  	return pmd_mkhuge(pfn_pmd(folio_pfn(folio), pgprot));
>  }
> -#endif
> +
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/**
> + * folio_mk_pud - Create a PUD for this folio
> + * @folio: The folio to create a PUD for
> + * @pgprot: The page protection bits to use
> + *
> + * Create a page table entry for the first page of this folio.
> + * This is suitable for passing to set_pud_at().
> + *
> + * Return: A page table entry suitable for mapping this folio.
> + */
> +static inline pud_t folio_mk_pud(struct folio *folio, pgprot_t pgprot)

Nice to have some consistency around pud, it seems so often we do a pmd version
of relevant functions then with pud we go 'meh whatever' :)

> +{
> +	return pud_mkhuge(pfn_pud(folio_pfn(folio), pgprot));
> +}
> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif /* CONFIG_MMU */
>
>  static inline bool folio_has_pincount(const struct folio *folio)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7e3e9028873e5..4734de1dc0ae4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1535,15 +1535,18 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
>  	return pud;
>  }
>
> -static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
> -		pud_t *pud, pfn_t pfn, pgprot_t prot, bool write)
> +static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
> +		pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pud_t entry;
>
>  	if (!pud_none(*pud)) {
> +		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
> +					  pfn_t_to_pfn(fop.pfn);
> +
>  		if (write) {
> -			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn)))
> +			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
>  				return;
>  			entry = pud_mkyoung(*pud);
>  			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
> @@ -1553,11 +1556,19 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>  		return;
>  	}
>
> -	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
> -	if (pfn_t_devmap(pfn))
> -		entry = pud_mkdevmap(entry);
> -	else
> -		entry = pud_mkspecial(entry);
> +	if (fop.is_folio) {
> +		entry = folio_mk_pud(fop.folio, vma->vm_page_prot);
> +
> +		folio_get(fop.folio);
> +		folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma);
> +		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR);

Nit, but might be nice to abstract for PMD/PUD.

> +	} else {
> +		entry = pud_mkhuge(pfn_t_pud(fop.pfn, prot));

Same incredibly pedantic whitespace comment from previous patch :)

> +		if (pfn_t_devmap(fop.pfn))
> +			entry = pud_mkdevmap(entry);
> +		else
> +			entry = pud_mkspecial(entry);
> +	}
>  	if (write) {
>  		entry = pud_mkyoung(pud_mkdirty(entry));
>  		entry = maybe_pud_mkwrite(entry, vma);
> @@ -1581,6 +1592,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	unsigned long addr = vmf->address & PUD_MASK;
>  	struct vm_area_struct *vma = vmf->vma;
>  	pgprot_t pgprot = vma->vm_page_prot;
> +	struct folio_or_pfn fop = {
> +		.pfn = pfn,
> +	};
>  	spinlock_t *ptl;
>
>  	/*
> @@ -1600,7 +1614,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	pfnmap_setup_cachemode_pfn(pfn_t_to_pfn(pfn), &pgprot);
>
>  	ptl = pud_lock(vma->vm_mm, vmf->pud);
> -	insert_pfn_pud(vma, addr, vmf->pud, pfn, pgprot, write);
> +	insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
>  	spin_unlock(ptl);
>
>  	return VM_FAULT_NOPAGE;
> @@ -1622,6 +1636,10 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>  	unsigned long addr = vmf->address & PUD_MASK;
>  	pud_t *pud = vmf->pud;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct folio_or_pfn fop = {
> +		.folio = folio,
> +		.is_folio = true,
> +	};
>  	spinlock_t *ptl;
>
>  	if (addr < vma->vm_start || addr >= vma->vm_end)
> @@ -1631,20 +1649,7 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>  		return VM_FAULT_SIGBUS;
>
>  	ptl = pud_lock(mm, pud);
> -
> -	/*
> -	 * If there is already an entry present we assume the folio is
> -	 * already mapped, hence no need to take another reference. We
> -	 * still call insert_pfn_pud() though in case the mapping needs
> -	 * upgrading to writeable.
> -	 */
> -	if (pud_none(*vmf->pud)) {
> -		folio_get(folio);
> -		folio_add_file_rmap_pud(folio, &folio->page, vma);
> -		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
> -	}
> -	insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)),
> -		       vma->vm_page_prot, write);
> +	insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
>  	spin_unlock(ptl);
>
>  	return VM_FAULT_NOPAGE;
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes
  2025-06-12  8:27       ` David Hildenbrand
@ 2025-06-12 16:56         ` Marc Herbert
  0 siblings, 0 replies; 35+ messages in thread
From: Marc Herbert @ 2025-06-12 16:56 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams, Alistair Popple
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Oscar Salvador

>>>>> I spent too much time trying to get the ndctl tests mentioned by Dan
>>>>> running (.config tweaks, memmap= setup, ... ), without getting them to
>>>>> pass even without these patches. Some SKIP, some FAIL, some sometimes
>>>>> suddenly SKIP on first invocation, ... instructions unclear or the tests
>>>>> are shaky. This is how far I got:
>>>>
>>>> FWIW I had a similar experience, although I eventually got the FAIL cases below
>>>> to pass. I forget exactly what I needed to tweak for that though :-/
>>>
>>> Add Marc who has been working to clean the documentation up to solve the
>>> reproducibility problem with standing up new environments to run these
>>> tests.
>>
>> I was about to send some doc improvements myself, but I didn't manage to
>> get the tests running in the first place ... even after trying hard :)
>>

>>> http://lore.kernel.org/20250521002640.1700283-1-marc.herbert@linux.intel.com
>>>
>>
>> I think I have CONFIG_XFS_FS=m (instead of y) and CONFIG_DAX=y (instead
>> of =m), and CONFIG_NFIT_SECURITY_DEBUG not set (instead of =y).
>>
>> Let me try with these settings adjusted.
> 
> Yeah, no. Unfortunately doesn't make it work with my debug config. Maybe with the
> defconfig as raised by Marc it would do ... maybe will try that later.

After a lot of trial and error to get them right, these fragments have always
worked for me:

make defconfig ARCH=x86_64
./scripts/kconfig/merge_config.sh .config ../run_qemu/.github/workflows/*.cfg

Warning: there is a CONFIG_DRM=n in there to save a lot of compilation
time.  Nothing against DRM specifically; it's just the best "value" for
a single line change :-)

The run_qemu/.github/workflows/*.cfg fragments are mostly duplicated
from ndctl.git/README.md - but unlike the latter, they're
machine-readable and testable. The CXL fragment is actually tested in 
run_qemu's CI (CI = the only way not to bitrot).
https://github.com/pmem/run_qemu/actions

As I wrote in
https://lore.kernel.org/linux-cxl/aed71134-1029-4b88-ab20-8dfa527a7438@linux.intel.com/
these fragments should ideally live in ndctl.git/, not in run_qemu.git/
(the latter could still add tweaks). Then ndctl.git/README.md could just
refer to the testable fragments instead of inlining them. "Send patches"
they say :-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()
  2025-06-12 16:49   ` Lorenzo Stoakes
@ 2025-06-12 17:00     ` David Hildenbrand
  2025-06-12 17:08       ` Lorenzo Stoakes
  0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12 17:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On 12.06.25 18:49, Lorenzo Stoakes wrote:
> On Wed, Jun 11, 2025 at 02:06:54PM +0200, David Hildenbrand wrote:
>> Marking PUDs that map a "normal" refcounted folios as special is
>> against our rules documented for vm_normal_page().
> 
> Might be worth referring to specifically which rule. I'm guessing it's the
> general one of special == don't touch (from vm_normal_page() comment):
> 
> /*
>   * vm_normal_page -- This function gets the "struct page" associated with a pte.
>   *
>   * "Special" mappings do not wish to be associated with a "struct page" (either
>   * it doesn't exist, or it exists but they don't want to touch it). In this
>   * case, NULL is returned here. "Normal" mappings do have a struct page.
>   *
>   * ...
>   *
>   */

Well, yes, the one vm_normal_page() is all about ... ? :)

> 
> But don't we already violate this E.g.:
> 
> 		if (vma->vm_ops && vma->vm_ops->find_special_page)
> 			return vma->vm_ops->find_special_page(vma, addr);
 > > I mean this in itself perhaps means we should update this comment 
to say 'except
> when file-backed and there is a find_special_page() hook'.

I rather hope we severely break this case such that we can remove that hack.

Read as in: I couldn't care less about this XEN hack, in particular, not 
documenting it.

I was already wondering about hiding it behind a XEN config so not each 
and every sane user of this function has to perform this crappy-hack check.

[...]

>>   	}
>>
>> -	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
>> -	if (pfn_t_devmap(pfn))
>> -		entry = pud_mkdevmap(entry);
>> -	else
>> -		entry = pud_mkspecial(entry);
>> +	if (fop.is_folio) {
>> +		entry = folio_mk_pud(fop.folio, vma->vm_page_prot);
>> +
>> +		folio_get(fop.folio);
>> +		folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma);
>> +		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR);
> 
> Nit, but might be nice to abstract for PMD/PUD.

Which part exactly? Likely a follow-up if it should be abstracted.

> 
>> +	} else {
>> +		entry = pud_mkhuge(pfn_t_pud(fop.pfn, prot));
> 
> Same incredibly pedantic whitespace comment from previous patch :)

;)


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()
  2025-06-12 17:00     ` David Hildenbrand
@ 2025-06-12 17:08       ` Lorenzo Stoakes
  2025-06-12 17:41         ` David Hildenbrand
  0 siblings, 1 reply; 35+ messages in thread
From: Lorenzo Stoakes @ 2025-06-12 17:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On Thu, Jun 12, 2025 at 07:00:01PM +0200, David Hildenbrand wrote:
> On 12.06.25 18:49, Lorenzo Stoakes wrote:
> > On Wed, Jun 11, 2025 at 02:06:54PM +0200, David Hildenbrand wrote:
> > > Marking PUDs that map a "normal" refcounted folios as special is
> > > against our rules documented for vm_normal_page().
> >
> > Might be worth referring to specifically which rule. I'm guessing it's the
> > general one of special == don't touch (from vm_normal_page() comment):
> >
> > /*
> >   * vm_normal_page -- This function gets the "struct page" associated with a pte.
> >   *
> >   * "Special" mappings do not wish to be associated with a "struct page" (either
> >   * it doesn't exist, or it exists but they don't want to touch it). In this
> >   * case, NULL is returned here. "Normal" mappings do have a struct page.
> >   *
> >   * ...
> >   *
> >   */
>
> Well, yes, the one vm_normal_page() is all about ... ? :)

Lol yes to be fair that is pretty obvious...

>
> >
> > But don't we already violate this E.g.:
> >
> > 		if (vma->vm_ops && vma->vm_ops->find_special_page)
> > 			return vma->vm_ops->find_special_page(vma, addr);
> > > I mean this in itself perhaps means we should update this comment to say
> 'except
> > when file-backed and there is a find_special_page() hook'.
>
> I rather hope we severely break this case such that we can remove that hack.
>
> Read as in: I couldn't care less about this XEN hack, in particular, not
> documenting it.
>
> I was already wondering about hiding it behind a XEN config so not each and
> every sane user of this function has to perform this crappy-hack check.

Yeah, I'm not a fan of generalised hooks if they can be avoided, especially ones
where you pass critical data structures like VMAs.

It means you can, in theory, make no assumptions about what the caller does and
yeah.

To do this for such a stupid edge case is ridiculous.

>
> [...]
>
> > >   	}
> > >
> > > -	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
> > > -	if (pfn_t_devmap(pfn))
> > > -		entry = pud_mkdevmap(entry);
> > > -	else
> > > -		entry = pud_mkspecial(entry);
> > > +	if (fop.is_folio) {
> > > +		entry = folio_mk_pud(fop.folio, vma->vm_page_prot);
> > > +
> > > +		folio_get(fop.folio);
> > > +		folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma);
> > > +		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR);
> >
> > Nit, but might be nice to abstract for PMD/PUD.
>
> Which part exactly? Likely a follow-up if it should be abstracted.

Ah on second thoughts it doesn't matter, because you're using pud variants of
everything such that it wouldn't be worth it.

Disregard this ;)

>
> >
> > > +	} else {
> > > +		entry = pud_mkhuge(pfn_t_pud(fop.pfn, prot));
> >
> > Same incredibly pedantic whitespace comment from previous patch :)
>
> ;)
>
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()
  2025-06-12 17:08       ` Lorenzo Stoakes
@ 2025-06-12 17:41         ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-12 17:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On 12.06.25 19:08, Lorenzo Stoakes wrote:
> On Thu, Jun 12, 2025 at 07:00:01PM +0200, David Hildenbrand wrote:
>> On 12.06.25 18:49, Lorenzo Stoakes wrote:
>>> On Wed, Jun 11, 2025 at 02:06:54PM +0200, David Hildenbrand wrote:
>>>> Marking PUDs that map a "normal" refcounted folios as special is
>>>> against our rules documented for vm_normal_page().
>>>
>>> Might be worth referring to specifically which rule. I'm guessing it's the
>>> general one of special == don't touch (from vm_normal_page() comment):
>>>
>>> /*
>>>    * vm_normal_page -- This function gets the "struct page" associated with a pte.
>>>    *
>>>    * "Special" mappings do not wish to be associated with a "struct page" (either
>>>    * it doesn't exist, or it exists but they don't want to touch it). In this
>>>    * case, NULL is returned here. "Normal" mappings do have a struct page.
>>>    *
>>>    * ...
>>>    *
>>>    */
>>
>> Well, yes, the one vm_normal_page() is all about ... ? :)
> 
> Lol yes to be fair that is pretty obvious...
> 
>>
>>>
>>> But don't we already violate this E.g.:
>>>
>>> 		if (vma->vm_ops && vma->vm_ops->find_special_page)
>>> 			return vma->vm_ops->find_special_page(vma, addr);
>>>> I mean this in itself perhaps means we should update this comment to say
>> 'except
>>> when file-backed and there is a find_special_page() hook'.
>>
>> I rather hope we severely break this case such that we can remove that hack.
>>
>> Read as in: I couldn't care less about this XEN hack, in particular, not
>> documenting it.
>>
>> I was already wondering about hiding it behind a XEN config so not each and
>> every sane user of this function has to perform this crappy-hack check.
> 
> Yeah, I'm not a fan of generalised hooks if they can be avoided, especially ones
> where you pass critical data structures like VMAs.
> 
> It means you can, in theory, make no assumptions about what the caller does and
> yeah.
> 
> To do this for such a stupid edge case is ridiculous.

Also, I am not sure if this works at all as intended. I want to look 
into cleaning that up ...

When we inserted the page, we sure must have taken a reference, but when 
we inserted it we set pte_special() and ... didn't take a reference? Hmmmm

> 
>>
>> [...]
>>
>>>>    	}
>>>>
>>>> -	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
>>>> -	if (pfn_t_devmap(pfn))
>>>> -		entry = pud_mkdevmap(entry);
>>>> -	else
>>>> -		entry = pud_mkspecial(entry);
>>>> +	if (fop.is_folio) {
>>>> +		entry = folio_mk_pud(fop.folio, vma->vm_page_prot);
>>>> +
>>>> +		folio_get(fop.folio);
>>>> +		folio_add_file_rmap_pud(fop.folio, &fop.folio->page, vma);
>>>> +		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PUD_NR);
>>>
>>> Nit, but might be nice to abstract for PMD/PUD.
>>
>> Which part exactly? Likely a follow-up if it should be abstracted.
> 
> Ah on second thoughts it doesn't matter, because you're using pud variants of
> everything such that it wouldn't be worth it.
> 
> Disregard this ;)

Ah, I was already suspecting that you might have missed the sneaky _pud :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()
  2025-06-11 12:06 ` [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() David Hildenbrand
                     ` (2 preceding siblings ...)
  2025-06-12 15:28   ` Lorenzo Stoakes
@ 2025-06-12 17:59   ` Jason Gunthorpe
  3 siblings, 0 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2025-06-12 17:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador, stable

On Wed, Jun 11, 2025 at 02:06:52PM +0200, David Hildenbrand wrote:
> We setup the cache mode but ... don't forward the updated pgprot to
> insert_pfn_pud().
> 
> Only a problem on x86-64 PAT when mapping PFNs using PUDs that
> require a special cachemode.
> 
> Fix it by using the proper pgprot where the cachemode was setup.
> 
> Identified by code inspection.
> 
> Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()
  2025-06-11 12:06 ` [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd() David Hildenbrand
                     ` (2 preceding siblings ...)
  2025-06-12 16:10   ` Lorenzo Stoakes
@ 2025-06-12 18:02   ` Jason Gunthorpe
  3 siblings, 0 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2025-06-12 18:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador

On Wed, Jun 11, 2025 at 02:06:53PM +0200, David Hildenbrand wrote:
> Marking PMDs that map a "normal" refcounted folios as special is
> against our rules documented for vm_normal_page().
> 
> Fortunately, there are not that many pmd_special() check that can be
> mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that
> would get this wrong right now are rather harmless: e.g., none so far
> bases decisions whether to grab a folio reference on that decision.
> 
> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
> implications as it seems.
> 
> Getting this right will get more important as we use
> folio_normal_page_pmd() in more places.
> 
> Fix it by teaching insert_pfn_pmd() to properly handle folios and
> pfns -- moving refcount/mapcount/etc handling in there, renaming it to
> insert_pmd(), and distinguishing between both cases using a new simple
> "struct folio_or_pfn" structure.
> 
> Use folio_mk_pmd() to create a pmd for a folio cleanly.
> 
> Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()")
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 58 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 39 insertions(+), 19 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()
  2025-06-11 12:06 ` [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud() David Hildenbrand
  2025-06-12  4:40   ` Dan Williams
  2025-06-12 16:49   ` Lorenzo Stoakes
@ 2025-06-12 18:02   ` Jason Gunthorpe
  2 siblings, 0 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2025-06-12 18:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Dan Williams, Oscar Salvador

On Wed, Jun 11, 2025 at 02:06:54PM +0200, David Hildenbrand wrote:
> Marking PUDs that map a "normal" refcounted folios as special is
> against our rules documented for vm_normal_page().
> 
> Fortunately, there are not that many pud_special() check that can be
> mislead and are right now rather harmless: e.g., none so far
> bases decisions whether to grab a folio reference on that decision.
> 
> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
> implications as it seems.
> 
> Getting this right will get more important as we introduce
> folio_normal_page_pud() and start using it in more place where we
> currently special-case based on other VMA flags.
> 
> Fix it just like we fixed vmf_insert_folio_pmd().
> 
> Add folio_mk_pud() to mimic what we do with folio_mk_pmd().
> 
> Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()")
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h | 19 ++++++++++++++++-
>  mm/huge_memory.c   | 51 +++++++++++++++++++++++++---------------------
>  2 files changed, 46 insertions(+), 24 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd()
  2025-06-12 16:10   ` Lorenzo Stoakes
@ 2025-06-13  7:44     ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2025-06-13  7:44 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, nvdimm, linux-cxl, Andrew Morton,
	Alistair Popple, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Dan Williams, Oscar Salvador

On 12.06.25 18:10, Lorenzo Stoakes wrote:
> On Wed, Jun 11, 2025 at 02:06:53PM +0200, David Hildenbrand wrote:
>> Marking PMDs that map a "normal" refcounted folios as special is
>> against our rules documented for vm_normal_page().
>>
>> Fortunately, there are not that many pmd_special() check that can be
>> mislead, and most vm_normal_page_pmd()/vm_normal_folio_pmd() users that
>> would get this wrong right now are rather harmless: e.g., none so far
>> bases decisions whether to grab a folio reference on that decision.
>>
>> Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
>> implications as it seems.
>>
>> Getting this right will get more important as we use
>> folio_normal_page_pmd() in more places.
>>
>> Fix it by teaching insert_pfn_pmd() to properly handle folios and
>> pfns -- moving refcount/mapcount/etc handling in there, renaming it to
>> insert_pmd(), and distinguishing between both cases using a new simple
>> "struct folio_or_pfn" structure.
>>
>> Use folio_mk_pmd() to create a pmd for a folio cleanly.
>>
>> Fixes: 6c88f72691f8 ("mm/huge_memory: add vmf_insert_folio_pmd()")
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Looks good to me, checked that the logic remains the same. Some micro
> nits/thoughts below. So:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!

> 
>> ---
>>   mm/huge_memory.c | 58 ++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 39 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 49b98082c5401..7e3e9028873e5 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1372,9 +1372,17 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>   	return __do_huge_pmd_anonymous_page(vmf);
>>   }
>>
>> -static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>> -		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
>> -		pgtable_t pgtable)
>> +struct folio_or_pfn {
>> +	union {
>> +		struct folio *folio;
>> +		pfn_t pfn;
>> +	};
>> +	bool is_folio;
>> +};
> 
> Interesting... I guess a memdesc world will make this easy... maybe? :)
> 
> But this is a neat way of passing this.
> 
> Another mega nit is mayyybe we could have a macro for making these like:
> 
> 
> #define DECLARE_FOP_PFN(name_, pfn_)		\
> 	struct folio_or_pfn name_ {		\
> 		.pfn = pfn_,			\
> 		.is_folio = false,		\
> 	}
> 
> #define DECLARE_FOP_FOLIO(name_, folio_)	\
> 	struct folio_or_pfn name_ {		\
> 		.folio = folio_,		\
> 		.is_folio = true,		\
> 	}
> 
> But yeah maybe overkill for this small usage in this file.

Yeah. I suspect at some point we will convert this into a folio+idx 
("page") or "pfn" approach, at which point we could also use this for 
ordinary insert_pfn().

(hopefully, then we can also do pfn_t -> unsigned long)

So let's defer adding that for now.

> 
>> +
>> +static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>> +		pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
>> +		bool write, pgtable_t pgtable)
>>   {
>>   	struct mm_struct *mm = vma->vm_mm;
>>   	pmd_t entry;
>> @@ -1382,8 +1390,11 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   	lockdep_assert_held(pmd_lockptr(mm, pmd));
>>
>>   	if (!pmd_none(*pmd)) {
>> +		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>> +					  pfn_t_to_pfn(fop.pfn);
>> +
>>   		if (write) {
>> -			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
>> +			if (pmd_pfn(*pmd) != pfn) {
>>   				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
>>   				return -EEXIST;
>>   			}
>> @@ -1396,11 +1407,19 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   		return -EEXIST;
>>   	}
>>
>> -	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>> -	if (pfn_t_devmap(pfn))
>> -		entry = pmd_mkdevmap(entry);
>> -	else
>> -		entry = pmd_mkspecial(entry);
>> +	if (fop.is_folio) {
>> +		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>> +
>> +		folio_get(fop.folio);
>> +		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>> +		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
>> +	} else {
>> +		entry = pmd_mkhuge(pfn_t_pmd(fop.pfn, prot));
> 
> Mega micro annoying nit - in above branch you have a newline after entry =, here
> you don't. Maybe should add here also?

Well, it's combining all the "entry" setup in one block. But I don't 
particularly care, so I'll just do it :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-06-13  7:44 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 12:06 [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes David Hildenbrand
2025-06-11 12:06 ` [PATCH v2 1/3] mm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud() David Hildenbrand
2025-06-12  1:56   ` Alistair Popple
2025-06-12  6:55     ` David Hildenbrand
2025-06-12  4:34   ` Dan Williams
2025-06-12  6:46     ` David Hildenbrand
2025-06-12 15:28   ` Lorenzo Stoakes
2025-06-12 15:36     ` David Hildenbrand
2025-06-12 15:59       ` Lorenzo Stoakes
2025-06-12 16:00         ` David Hildenbrand
2025-06-12 17:59   ` Jason Gunthorpe
2025-06-11 12:06 ` [PATCH v2 2/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pmd() David Hildenbrand
2025-06-12  2:17   ` Alistair Popple
2025-06-12  7:06     ` David Hildenbrand
2025-06-12  4:36   ` Dan Williams
2025-06-12 16:10   ` Lorenzo Stoakes
2025-06-13  7:44     ` David Hildenbrand
2025-06-12 18:02   ` Jason Gunthorpe
2025-06-11 12:06 ` [PATCH v2 3/3] mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud() David Hildenbrand
2025-06-12  4:40   ` Dan Williams
2025-06-12 16:49   ` Lorenzo Stoakes
2025-06-12 17:00     ` David Hildenbrand
2025-06-12 17:08       ` Lorenzo Stoakes
2025-06-12 17:41         ` David Hildenbrand
2025-06-12 18:02   ` Jason Gunthorpe
2025-06-11 23:08 ` [PATCH v2 0/3] mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes Andrew Morton
2025-06-12  7:34   ` David Hildenbrand
2025-06-12  2:26 ` Alistair Popple
2025-06-12  4:20   ` Dan Williams
2025-06-12  7:18     ` David Hildenbrand
2025-06-12  8:27       ` David Hildenbrand
2025-06-12 16:56         ` Marc Herbert
2025-06-12 16:19 ` Lorenzo Stoakes
2025-06-12 16:22   ` David Hildenbrand
2025-06-12 16:30     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).