[PATCH v3 0/8] mm/mprotect: Fix dax puds

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/8] mm/mprotect: Fix dax puds
@ 2024-07-15 19:21 Peter Xu
  2024-07-15 19:21 ` [PATCH v3 1/8] mm/dax: Dump start address in fault handler Peter Xu
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

[Based on mm-unstable, commit 31334cf98dbd, July 2nd]

v3:
- Fix a build issue on i386 PAE config
- Moved one line from patch 8 to patch 3

v1: https://lore.kernel.org/r/20240621142504.1940209-1-peterx@redhat.com
v2: https://lore.kernel.org/r/20240703212918.2417843-1-peterx@redhat.com

Dax supports pud pages for a while, but mprotect on puds was missing since
the start.  This series tries to fix that by providing pud handling in
mprotect().  The goal is to add more types of pud mappings like hugetlb or
pfnmaps.  This series paves way for it by fixing known pud entries.

Considering nobody reported this until when I looked at those other types
of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
and this may not need to be backported.  I would guess whoever cares about
mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
new kernels would be fine, but I'm open to suggestions.

There're a few small things changed to teach mprotect work on PUDs. E.g. it
will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop
making sense when there can be more than one type of huge pte.  OTOH, we'll
also need to push the mmu notifiers from pmd to pud layers, which might
need some attention but so far I think it's safe.  For such details, please
refer to each patch's commit message.

The mprotect() pud process should be straightforward, as I kept it as
simple as possible.  There's no NUMA handled as dax simply doesn't support
that.  There's also no userfault involvements as file memory (even if work
with userfault-wp async mode) will need to split a pud, so pud entry
doesn't need to yet know userfault's existance (but hugetlb entries will;
that's also for later).

Tests
=====

What I did test:

- cross-build tests that I normally cover [1]

- smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
  mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
  namespaces with proper alignments, which used to throw "bad pud" but now
  it'll run through all fine.  I checked sigbus happens if with illegal
  access on protected puds.

- vmtests.

What I didn't test:

- fsdax: I wanted to also give it a shot, but only until then I noticed it
  doesn't seem to be supported (according to dax_iomap_fault(), which will
  always fallback on PUD_ORDER).  I did remember it was supported before, I
  could miss something important there.. please shoot if so.

- userfault wp-async: I also wanted to test userfault-wp async be able to
  split huge puds (here it's simply a clear_pud.. though), but it won't
  work for devdax anyway due to not allowed to do smaller than 1G faults in
  this case. So skip too.

- Power, as no hardware on hand.

Thanks,

[1] https://gitlab.com/peterx/lkb-harness/-/blob/main/config.json
[2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c
[3] https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt

Peter Xu (8):
  mm/dax: Dump start address in fault handler
  mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  mm/mprotect: Push mmu notifier to PUDs
  mm/powerpc: Add missing pud helpers
  mm/x86: Make pud_leaf() only cares about PSE bit
  mm/x86: arch_check_zapped_pud()
  mm/x86: Add missing pud helpers
  mm/mprotect: fix dax pud handlings

 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +
 arch/powerpc/mm/book3s64/pgtable.c           | 20 ++++++
 arch/x86/include/asm/pgtable.h               | 68 +++++++++++++++---
 arch/x86/mm/pgtable.c                        | 19 +++++
 drivers/dax/device.c                         |  6 +-
 include/linux/huge_mm.h                      | 24 +++++++
 include/linux/pgtable.h                      |  7 ++
 include/linux/vm_event_item.h                |  1 -
 mm/huge_memory.c                             | 56 ++++++++++++++-
 mm/mprotect.c                                | 74 ++++++++++++--------
 mm/vmstat.c                                  |  1 -
 11 files changed, 234 insertions(+), 45 deletions(-)

-- 
2.45.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/8] mm/dax: Dump start address in fault handler
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
@ 2024-07-15 19:21 ` Peter Xu
  2024-07-31 12:04   ` David Hildenbrand
  2024-07-15 19:21 ` [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES Peter Xu
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

Currently the dax fault handler dumps the vma range when dynamic debugging
enabled.  That's mostly not useful.  Dump the (aligned) address instead
with the order info.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/dax/device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index eb61598247a9..714174844ca5 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order)
 	int id;
 	struct dev_dax *dev_dax = filp->private_data;
 
-	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm,
-			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
-			vmf->vma->vm_start, vmf->vma->vm_end, order);
+	dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm,
+		(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
+		vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order);
 
 	id = dax_read_lock();
 	if (order == 0)
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
  2024-07-15 19:21 ` [PATCH v3 1/8] mm/dax: Dump start address in fault handler Peter Xu
@ 2024-07-15 19:21 ` Peter Xu
  2024-07-31 12:18   ` David Hildenbrand
  2024-07-15 19:21 ` [PATCH v3 3/8] mm/mprotect: Push mmu notifier to PUDs Peter Xu
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar, Alex Thorlton

In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages
altered by protection changes") introduced "numa_huge_pte_updates" vmstat
entry, trying to capture how many huge ptes (in reality, PMD thps at that
time) are marked by NUMA balancing.

This patch proposes to remove it for some reasons.

Firstly, the name is misleading. We can have more than one way to have a
"huge pte" at least nowadays, and that's also the major goal of this patch,
where it paves way for PUD handling in change protection code paths.

PUDs are coming not only for dax (which has already came and yet broken..),
but also for pfnmaps and hugetlb pages.  The name will simply stop making
sense when PUD will start to be involved in mprotect() world.

It'll also make it not reasonable either if we boost the counter for both
pmd/puds.  In short, current accounting won't be right when PUD comes, so
the scheme was only suitable at that point in time where PUD wasn't even
possible.

Secondly, the accounting was simply not right from the start as long as it
was also affected by other call sites besides NUMA.  mprotect() is one,
while userfaultfd-wp also leverages change protection path to modify
pgtables.  If it wants to do right it needs to check the caller but it
never did; at least mprotect() should be there even in 2013.

It gives me the impression that nobody is seriously using this field, and
it's also impossible to be serious.

We may want to do it right if any NUMA developers would like it to exist,
but we should do that with all above resolved, on both considering PUDs,
but also on correct accountings.  That should be able to be done on top
when there's a real need of such.

Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/vm_event_item.h | 1 -
 mm/mprotect.c                 | 8 +-------
 mm/vmstat.c                   | 1 -
 3 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..2a3797fb6742 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -59,7 +59,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		OOM_KILL,
 #ifdef CONFIG_NUMA_BALANCING
 		NUMA_PTE_UPDATES,
-		NUMA_HUGE_PTE_UPDATES,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 222ab434da54..21172272695e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,7 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 	pmd_t *pmd;
 	unsigned long next;
 	long pages = 0;
-	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;

 	range.start = 0;
@@ -411,11 +410,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 				ret = change_huge_pmd(tlb, vma, pmd,
 						addr, newprot, cp_flags);
 				if (ret) {
-					if (ret == HPAGE_PMD_NR) {
+					if (ret == HPAGE_PMD_NR)
 						pages += HPAGE_PMD_NR;
-						nr_huge_updates++;
-					}
-
 					/* huge pmd was handled */
 					goto next;
 				}
@@ -435,8 +431,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 	if (range.start)
 		mmu_notifier_invalidate_range_end(&range);

-	if (nr_huge_updates)
-		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
 	return pages;
 }

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 73d791d1caad..53656227f70d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = {

 #ifdef CONFIG_NUMA_BALANCING
 	"numa_pte_updates",
-	"numa_huge_pte_updates",
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
-- 
2.45.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/8] mm/mprotect: Push mmu notifier to PUDs
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
  2024-07-15 19:21 ` [PATCH v3 1/8] mm/dax: Dump start address in fault handler Peter Xu
  2024-07-15 19:21 ` [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES Peter Xu
@ 2024-07-15 19:21 ` Peter Xu
  2024-07-15 19:21 ` [PATCH v3 4/8] mm/powerpc: Add missing pud helpers Peter Xu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar, kvm,
	Sean Christopherson, Paolo Bonzini, David Rientjes

mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
change_pmd_range").

At that time, the issue was that NUMA balancing can be applied on a huge
range of VM memory, even if nothing was populated.  The notification can be
avoided in this case if no valid pmd detected, which includes either THP or
a PTE pgtable page.

Now to pave way for PUD handling, this isn't enough.  We need to generate
mmu notifications even on PUD entries properly.  mprotect() is currently
broken on PUD (e.g., one can easily trigger kernel error with dax 1G
mappings already), this is the start to fix it.

To fix that, this patch proposes to push such notifications to the PUD
layers.

There is risk on regressing the problem Rik wanted to resolve before, but I
think it shouldn't really happen, and I still chose this solution because
of a few reasons:

  1) Consider a large VM that should definitely contain more than GBs of
  memory, it's highly likely that PUDs are also none.  In this case there
  will have no regression.

  2) KVM has evolved a lot over the years to get rid of rmap walks, which
  might be the major cause of the previous soft-lockup.  At least TDP MMU
  already got rid of rmap as long as not nested (which should be the major
  use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
  pgtable (e.g. EPT on x86), the invalidation of a full empty region in
  most cases could be pretty fast now, comparing to 2014.

  3) KVM has explicit code paths now to even give way for mmu notifiers
  just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry
  fault before acquiring mmu_lock if mapping is changing").  It'll also
  avoid contentions that may also contribute to a soft-lockup.

  4) Stick with PMD layer simply don't work when PUD is there...  We need
  one way or another to fix PUD mappings on mprotect().

Pushing it to PUD should be the safest approach as of now, e.g. there's yet
no sign of huge P4D coming on any known archs.

Cc: kvm@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 21172272695e..2a81060b603d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 	pmd_t *pmd;
 	unsigned long next;
 	long pages = 0;
-	struct mmu_notifier_range range;
-
-	range.start = 0;

 	pmd = pmd_offset(pud, addr);
 	do {
@@ -383,14 +380,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_none(*pmd))
 			goto next;

-		/* invoke the mmu notifier if the pmd is populated */
-		if (!range.start) {
-			mmu_notifier_range_init(&range,
-				MMU_NOTIFY_PROTECTION_VMA, 0,
-				vma->vm_mm, addr, end);
-			mmu_notifier_invalidate_range_start(&range);
-		}
-
 		_pmd = pmdp_get_lockless(pmd);
 		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
@@ -428,9 +417,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);

-	if (range.start)
-		mmu_notifier_invalidate_range_end(&range);
-
 	return pages;
 }

@@ -438,22 +424,36 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
+	struct mmu_notifier_range range;
 	pud_t *pud;
 	unsigned long next;
 	long pages = 0, ret;

+	range.start = 0;
+
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		ret = change_prepare(vma, pud, pmd, addr, cp_flags);
-		if (ret)
-			return ret;
+		if (ret) {
+			pages = ret;
+			break;
+		}
 		if (pud_none_or_clear_bad(pud))
 			continue;
+		if (!range.start) {
+			mmu_notifier_range_init(&range,
+						MMU_NOTIFY_PROTECTION_VMA, 0,
+						vma->vm_mm, addr, end);
+			mmu_notifier_invalidate_range_start(&range);
+		}
 		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
 					  cp_flags);
 	} while (pud++, addr = next, addr != end);

+	if (range.start)
+		mmu_notifier_invalidate_range_end(&range);
+
 	return pages;
 }

-- 
2.45.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 4/8] mm/powerpc: Add missing pud helpers
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
                   ` (2 preceding siblings ...)
  2024-07-15 19:21 ` [PATCH v3 3/8] mm/mprotect: Push mmu notifier to PUDs Peter Xu
@ 2024-07-15 19:21 ` Peter Xu
  2024-07-15 19:21 ` [PATCH v3 5/8] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

These new helpers will be needed for pud entry updates soon.  Introduce
them by referencing the pmd ones.  Namely:

- pudp_invalidate()
- pud_modify()

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +++
 arch/powerpc/mm/book3s64/pgtable.c           | 20 ++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 519b1743a0f4..5da92ba68a45 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1124,6 +1124,7 @@ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
 extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
+extern pud_t pud_modify(pud_t pud, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		       pmd_t *pmdp, pmd_t pmd);
 extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
@@ -1384,6 +1385,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
 #define __HAVE_ARCH_PMDP_INVALIDATE
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			     pmd_t *pmdp);
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			     pud_t *pudp);
 
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index f4d8d3c40e5c..5a4a75369043 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -176,6 +176,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 	return __pmd(old_pmd);
 }
 
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		      pud_t *pudp)
+{
+	unsigned long old_pud;
+
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	old_pud = pud_hugepage_update(vma->vm_mm, address, pudp, _PAGE_PRESENT, _PAGE_INVALID);
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return __pud(old_pud);
+}
+
 pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmdp, int full)
 {
@@ -259,6 +270,15 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	pmdv &= _HPAGE_CHG_MASK;
 	return pmd_set_protbits(__pmd(pmdv), newprot);
 }
+
+pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+	unsigned long pudv;
+
+	pudv = pud_val(pud);
+	pudv &= _HPAGE_CHG_MASK;
+	return pud_set_protbits(__pud(pudv), newprot);
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /* For use by kexec, called with MMU off */
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 5/8] mm/x86: Make pud_leaf() only cares about PSE bit
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
                   ` (3 preceding siblings ...)
  2024-07-15 19:21 ` [PATCH v3 4/8] mm/powerpc: Add missing pud helpers Peter Xu
@ 2024-07-15 19:21 ` Peter Xu
  2024-07-31 12:22   ` David Hildenbrand
  2024-07-15 19:21 ` [PATCH v3 6/8] mm/x86: arch_check_zapped_pud() Peter Xu
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

An entry should be reported as PUD leaf even if it's PROT_NONE, in which
case PRESENT bit isn't there. I hit bad pud without this when testing dax
1G on zapping a PROT_NONE PUD.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..25fc6d809572 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1073,8 +1073,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
 #define pud_leaf pud_leaf
 static inline bool pud_leaf(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
-		(_PAGE_PSE | _PAGE_PRESENT);
+	return pud_val(pud) & _PAGE_PSE;
 }
 
 static inline int pud_bad(pud_t pud)
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 6/8] mm/x86: arch_check_zapped_pud()
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
                   ` (4 preceding siblings ...)
  2024-07-15 19:21 ` [PATCH v3 5/8] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
@ 2024-07-15 19:21 ` Peter Xu
  2024-07-31 12:23   ` David Hildenbrand
  2024-07-15 19:21 ` [PATCH v3 7/8] mm/x86: Add missing pud helpers Peter Xu
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps.
It has the same logic of the PMD helper.

One thing to mention is, it might be a good idea to use page_table_check in
the future for trapping wrong setups of shadow stack pgtable entries [1].
That is left for the future as a separate effort.

[1] https://lore.kernel.org/all/59d518698f664e07c036a5098833d7b56b953305.camel@intel.com

Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 10 ++++++++++
 arch/x86/mm/pgtable.c          |  7 +++++++
 include/linux/pgtable.h        |  7 +++++++
 mm/huge_memory.c               |  4 +++-
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 25fc6d809572..cdf044c2ad6e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -169,6 +169,13 @@ static inline int pud_young(pud_t pud)
 	return pud_flags(pud) & _PAGE_ACCESSED;
 }
 
+static inline bool pud_shstk(pud_t pud)
+{
+	return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+	       (pud_flags(pud) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
+	       (_PAGE_DIRTY | _PAGE_PSE);
+}
+
 static inline int pte_write(pte_t pte)
 {
 	/*
@@ -1662,6 +1669,9 @@ void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
 #define arch_check_zapped_pmd arch_check_zapped_pmd
 void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd);
 
+#define arch_check_zapped_pud arch_check_zapped_pud
+void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud);
+
 #ifdef CONFIG_XEN_PV
 #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young
 static inline bool arch_has_hw_nonleaf_pmd_young(void)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 93e54ba91fbf..564b8945951e 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -926,3 +926,10 @@ void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd)
 	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
 			pmd_shstk(pmd));
 }
+
+void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
+{
+	/* See note in arch_check_zapped_pte() */
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+			pud_shstk(pud));
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2a6a3cccfc36..2289e9f7aa1b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -447,6 +447,13 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_check_zapped_pud
+static inline void arch_check_zapped_pud(struct vm_area_struct *vma,
+					 pud_t pud)
+{
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9fec5bd1c8b0..c10247bef08a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2291,12 +2291,14 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
 	spinlock_t *ptl;
+	pud_t orig_pud;
 
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
 		return 0;
 
-	pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
+	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
+	arch_check_zapped_pud(vma, orig_pud);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
 	if (vma_is_special_huge(vma)) {
 		spin_unlock(ptl);
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 7/8] mm/x86: Add missing pud helpers
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
                   ` (5 preceding siblings ...)
  2024-07-15 19:21 ` [PATCH v3 6/8] mm/x86: arch_check_zapped_pud() Peter Xu
@ 2024-07-15 19:21 ` Peter Xu
  2024-07-15 19:21 ` [PATCH v3 8/8] mm/mprotect: fix dax pud handlings Peter Xu
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

These new helpers will be needed for pud entry updates soon.  Introduce
these helpers by referencing the pmd ones.  Namely:

- pudp_invalidate()
- pud_modify()

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 55 +++++++++++++++++++++++++++++-----
 arch/x86/mm/pgtable.c          | 12 ++++++++
 2 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cdf044c2ad6e..701593c53f3b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -782,6 +782,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+	return pfn_pud(pud_pfn(pud),
+		       __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -829,14 +835,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	pmd_result = __pmd(val);
 
 	/*
-	 * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid:
-	 *  1. Marking Write=0 PMDs Dirty=1
-	 *  2. Marking Dirty=1 PMDs Write=0
-	 *
-	 * The first case cannot happen because the _PAGE_CHG_MASK will filter
-	 * out any Dirty bit passed in newprot. Handle the second case by
-	 * going through the mksaveddirty exercise. Only do this if the old
-	 * value was Write=1 to avoid doing this on Shadow Stack PTEs.
+	 * Avoid creating shadow stack PMD by accident.  See comment in
+	 * pte_modify().
 	 */
 	if (oldval & _PAGE_RW)
 		pmd_result = pmd_mksaveddirty(pmd_result);
@@ -846,6 +846,29 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pmd_result;
 }
 
+static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+	pudval_t val = pud_val(pud), oldval = val;
+	pud_t pud_result;
+
+	val &= _HPAGE_CHG_MASK;
+	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
+
+	pud_result = __pud(val);
+
+	/*
+	 * Avoid creating shadow stack PUD by accident.  See comment in
+	 * pte_modify().
+	 */
+	if (oldval & _PAGE_RW)
+		pud_result = pud_mksaveddirty(pud_result);
+	else
+		pud_result = pud_clear_saveddirty(pud_result);
+
+	return pud_result;
+}
+
 /*
  * mprotect needs to preserve PAT and encryption bits when updating
  * vm_page_prot
@@ -1384,10 +1407,26 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 }
 #endif
 
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	page_table_check_pud_set(vma->vm_mm, pudp, pud);
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pudp, pud);
+	} else {
+		pud_t old = *pudp;
+		WRITE_ONCE(*pudp, pud);
+		return old;
+	}
+}
+
 #define __HAVE_ARCH_PMDP_INVALIDATE_AD
 extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
 				unsigned long address, pmd_t *pmdp);
 
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		      pud_t *pudp);
+
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 564b8945951e..fa77411bb266 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -641,6 +641,18 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return old;
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 8/8] mm/mprotect: fix dax pud handlings
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
                   ` (6 preceding siblings ...)
  2024-07-15 19:21 ` [PATCH v3 7/8] mm/x86: Add missing pud helpers Peter Xu
@ 2024-07-15 19:21 ` Peter Xu
  2024-07-25 18:29   ` James Houghton
  2024-07-15 20:00 ` [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
  2024-07-24 15:15 ` Peter Xu
  9 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-07-15 19:21 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, peterx,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

This is only relevant to the two archs that support PUD dax, aka, x86_64
and ppc64.  PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not
count in this case.

DAX have had PUD mappings for years, but change protection path never
worked.  When the path is triggered in any form (a simple test program
would be: call mprotect() on a 1G dev_dax mapping), the kernel will report
"bad pud".  This patch should fix that.

The new change_huge_pud() tries to keep everything simple.  For example, it
doesn't optimize write bit as that will need even more PUD helpers.  It's
not too bad anyway to have one more write fault in the worst case once for
1G range; may be a bigger thing for each PAGE_SIZE, though.  Neither does
it support userfault-wp bits, as there isn't such PUD mappings that is
supported; file mappings always need a split there.

The same to TLB shootdown: the pmd path (which was for x86 only) has the
trick of using _ad() version of pmdp_invalidate*() which can avoid one
redundant TLB, but let's also leave that for later.  Again, the larger the
mapping, the smaller of such effect.

Another thing worth mention is this path needs to be careful on handling
"retry" event for change_huge_pud() (where it can return 0): it isn't like
change_huge_pmd(), as the pmd version is safe with all conditions handled
in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock().
In short, change_pte_range() is simply smarter than change_pmd_range() now
after the shmem thp collapse rework.  For that reason, change_pud_range()
will need proper retry if it races with something else when a huge PUD
changed from under us.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 24 +++++++++++++++++++
 mm/huge_memory.c        | 52 +++++++++++++++++++++++++++++++++++++++++
 mm/mprotect.c           | 34 ++++++++++++++++++++++-----
 3 files changed, 104 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index cff002be83eb..6e742680590a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -336,6 +336,17 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long address);
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags);
+#else
+static inline int
+change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		unsigned long cp_flags) { return 0; }
+#endif
+
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
@@ -579,6 +590,19 @@ static inline int next_order(unsigned long *orders, int prev)
 {
 	return 0;
 }
+
+static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+				    unsigned long address)
+{
+}
+
+static inline int change_huge_pud(struct mmu_gather *tlb,
+				  struct vm_area_struct *vma, pud_t *pudp,
+				  unsigned long addr, pgprot_t newprot,
+				  unsigned long cp_flags)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c10247bef08a..9a00c5955c0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2112,6 +2112,53 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return ret;
 }
 
+/*
+ * Returns:
+ *
+ * - 0: if pud leaf changed from under us
+ * - 1: if pud can be skipped
+ * - HPAGE_PUD_NR: if pud was successfully processed
+ */
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t oldpud, entry;
+	spinlock_t *ptl;
+
+	tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
+	/* NUMA balancing doesn't apply to dax */
+	if (cp_flags & MM_CP_PROT_NUMA)
+		return 1;
+
+	/*
+	 * Huge entries on userfault-wp only works with anonymous, while we
+	 * don't have anonymous PUDs yet.
+	 */
+	if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL))
+		return 1;
+
+	ptl = __pud_trans_huge_lock(pudp, vma);
+	if (!ptl)
+		return 0;
+
+	/*
+	 * Can't clear PUD or it can race with concurrent zapping.  See
+	 * change_huge_pmd().
+	 */
+	oldpud = pudp_invalidate(vma, addr, pudp);
+	entry = pud_modify(oldpud, newprot);
+	set_pud_at(mm, addr, pudp, entry);
+	tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE);
+
+	spin_unlock(ptl);
+	return HPAGE_PUD_NR;
+}
+#endif
+
 #ifdef CONFIG_USERFAULTFD
 /*
  * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
@@ -2342,6 +2389,11 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 }
+#else
+void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+		unsigned long address)
+{
+}
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2a81060b603d..694f13b83864 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -425,31 +425,53 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mmu_notifier_range range;
-	pud_t *pud;
+	pud_t *pudp, pud;
 	unsigned long next;
 	long pages = 0, ret;
 
 	range.start = 0;
 
-	pud = pud_offset(p4d, addr);
+	pudp = pud_offset(p4d, addr);
 	do {
+again:
 		next = pud_addr_end(addr, end);
-		ret = change_prepare(vma, pud, pmd, addr, cp_flags);
+		ret = change_prepare(vma, pudp, pmd, addr, cp_flags);
 		if (ret) {
 			pages = ret;
 			break;
 		}
-		if (pud_none_or_clear_bad(pud))
+
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
 			continue;
+
 		if (!range.start) {
 			mmu_notifier_range_init(&range,
 						MMU_NOTIFY_PROTECTION_VMA, 0,
 						vma->vm_mm, addr, end);
 			mmu_notifier_invalidate_range_start(&range);
 		}
-		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
+
+		if (pud_leaf(pud)) {
+			if ((next - addr != PUD_SIZE) ||
+			    pgtable_split_needed(vma, cp_flags)) {
+				__split_huge_pud(vma, pudp, addr);
+				goto again;
+			} else {
+				ret = change_huge_pud(tlb, vma, pudp,
+						      addr, newprot, cp_flags);
+				if (ret == 0)
+					goto again;
+				/* huge pud was handled */
+				if (ret == HPAGE_PUD_NR)
+					pages += HPAGE_PUD_NR;
+				continue;
+			}
+		}
+
+		pages += change_pmd_range(tlb, vma, pudp, addr, next, newprot,
 					  cp_flags);
-	} while (pud++, addr = next, addr != end);
+	} while (pudp++, addr = next, addr != end);
 
 	if (range.start)
 		mmu_notifier_invalidate_range_end(&range);
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/8] mm/mprotect: Fix dax puds
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
                   ` (7 preceding siblings ...)
  2024-07-15 19:21 ` [PATCH v3 8/8] mm/mprotect: fix dax pud handlings Peter Xu
@ 2024-07-15 20:00 ` Peter Xu
  2024-07-24 15:15 ` Peter Xu
  9 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-07-15 20:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, Oscar Salvador,
	Mel Gorman, Andrew Morton, Borislav Petkov, Christophe Leroy,
	Huang Ying, Kirill A . Shutemov, Aneesh Kumar K . V, Dan Williams,
	Thomas Gleixner, Hugh Dickins, x86, Nicholas Piggin,
	Vlastimil Babka, Ingo Molnar

On Mon, Jul 15, 2024 at 03:21:34PM -0400, Peter Xu wrote:
> [Based on mm-unstable, commit 31334cf98dbd, July 2nd]

I forgot to update here in the cover letter; it's actually based on the
lastest..  Which is 79ae458094ff, as of today (July 15th).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/8] mm/mprotect: Fix dax puds
  2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
                   ` (8 preceding siblings ...)
  2024-07-15 20:00 ` [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
@ 2024-07-24 15:15 ` Peter Xu
  9 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-07-24 15:15 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, Oscar Salvador,
	Mel Gorman, Andrew Morton, Borislav Petkov, Christophe Leroy,
	Huang Ying, Kirill A . Shutemov, Aneesh Kumar K . V, Dan Williams,
	Thomas Gleixner, Hugh Dickins, x86, Nicholas Piggin,
	Vlastimil Babka, Ingo Molnar

On Mon, Jul 15, 2024 at 03:21:34PM -0400, Peter Xu wrote:
> [Based on mm-unstable, commit 31334cf98dbd, July 2nd]
> 
> v3:
> - Fix a build issue on i386 PAE config
> - Moved one line from patch 8 to patch 3
> 
> v1: https://lore.kernel.org/r/20240621142504.1940209-1-peterx@redhat.com
> v2: https://lore.kernel.org/r/20240703212918.2417843-1-peterx@redhat.com
> 
> Dax supports pud pages for a while, but mprotect on puds was missing since
> the start.  This series tries to fix that by providing pud handling in
> mprotect().  The goal is to add more types of pud mappings like hugetlb or
> pfnmaps.  This series paves way for it by fixing known pud entries.
> 
> Considering nobody reported this until when I looked at those other types
> of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
> and this may not need to be backported.  I would guess whoever cares about
> mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
> new kernels would be fine, but I'm open to suggestions.
> 
> There're a few small things changed to teach mprotect work on PUDs. E.g. it
> will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop
> making sense when there can be more than one type of huge pte.  OTOH, we'll
> also need to push the mmu notifiers from pmd to pud layers, which might
> need some attention but so far I think it's safe.  For such details, please
> refer to each patch's commit message.
> 
> The mprotect() pud process should be straightforward, as I kept it as
> simple as possible.  There's no NUMA handled as dax simply doesn't support
> that.  There's also no userfault involvements as file memory (even if work
> with userfault-wp async mode) will need to split a pud, so pud entry
> doesn't need to yet know userfault's existance (but hugetlb entries will;
> that's also for later).
> 
> Tests
> =====
> 
> What I did test:
> 
> - cross-build tests that I normally cover [1]
> 
> - smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
>   mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
>   namespaces with proper alignments, which used to throw "bad pud" but now
>   it'll run through all fine.  I checked sigbus happens if with illegal
>   access on protected puds.
> 
> - vmtests.
> 
> What I didn't test:
> 
> - fsdax: I wanted to also give it a shot, but only until then I noticed it
>   doesn't seem to be supported (according to dax_iomap_fault(), which will
>   always fallback on PUD_ORDER).  I did remember it was supported before, I
>   could miss something important there.. please shoot if so.
> 
> - userfault wp-async: I also wanted to test userfault-wp async be able to
>   split huge puds (here it's simply a clear_pud.. though), but it won't
>   work for devdax anyway due to not allowed to do smaller than 1G faults in
>   this case. So skip too.
> 
> - Power, as no hardware on hand.

Ping - any review comments or even tests would be greatly welcomed.

I'm not sure whether this matters for anyone yet so far.  I hope this still
makes sense for DAX even if this is an extremely corner case...

Just to mention the follow up users of this path:

  - huge pfnmap 1G may use this, when VM_PFNMAP can be mapped with 1G too,
    then we should hit similar "bad pud" here.

  - hugetlb rework will use this, when we want this path to process 1G
    hugetlb pages too.

The 1st user is not a must in my initial plan, as VFIO + VM use case
doesn't use mprotect(), so we can keep (1) broken together with DAX 1G
here.  But for the long term we should still fix this, IMHO.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] mm/mprotect: fix dax pud handlings
  2024-07-15 19:21 ` [PATCH v3 8/8] mm/mprotect: fix dax pud handlings Peter Xu
@ 2024-07-25 18:29   ` James Houghton
  2024-07-25 22:41     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: James Houghton @ 2024-07-25 18:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

On Mon, Jul 15, 2024 at 12:22 PM Peter Xu <peterx@redhat.com> wrote:
>
> This is only relevant to the two archs that support PUD dax, aka, x86_64
> and ppc64.  PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not
> count in this case.
>
> DAX have had PUD mappings for years, but change protection path never
> worked.  When the path is triggered in any form (a simple test program
> would be: call mprotect() on a 1G dev_dax mapping), the kernel will report
> "bad pud".  This patch should fix that.
>
> The new change_huge_pud() tries to keep everything simple.  For example, it
> doesn't optimize write bit as that will need even more PUD helpers.  It's
> not too bad anyway to have one more write fault in the worst case once for
> 1G range; may be a bigger thing for each PAGE_SIZE, though.  Neither does
> it support userfault-wp bits, as there isn't such PUD mappings that is
> supported; file mappings always need a split there.
>
> The same to TLB shootdown: the pmd path (which was for x86 only) has the
> trick of using _ad() version of pmdp_invalidate*() which can avoid one
> redundant TLB, but let's also leave that for later.  Again, the larger the
> mapping, the smaller of such effect.
>
> Another thing worth mention is this path needs to be careful on handling
> "retry" event for change_huge_pud() (where it can return 0): it isn't like
> change_huge_pmd(), as the pmd version is safe with all conditions handled
> in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock().
> In short, change_pte_range() is simply smarter than change_pmd_range() now
> after the shmem thp collapse rework.  For that reason, change_pud_range()
> will need proper retry if it races with something else when a huge PUD
> changed from under us.
>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: x86@kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
> Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage")
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/huge_mm.h | 24 +++++++++++++++++++
>  mm/huge_memory.c        | 52 +++++++++++++++++++++++++++++++++++++++++
>  mm/mprotect.c           | 34 ++++++++++++++++++++++-----
>  3 files changed, 104 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index cff002be83eb..6e742680590a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -336,6 +336,17 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>                 unsigned long address);
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +                   pud_t *pudp, unsigned long addr, pgprot_t newprot,
> +                   unsigned long cp_flags);
> +#else
> +static inline int
> +change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +               pud_t *pudp, unsigned long addr, pgprot_t newprot,
> +               unsigned long cp_flags) { return 0; }
> +#endif
> +
>  #define split_huge_pud(__vma, __pud, __address)                                \
>         do {                                                            \
>                 pud_t *____pud = (__pud);                               \
> @@ -579,6 +590,19 @@ static inline int next_order(unsigned long *orders, int prev)
>  {
>         return 0;
>  }
> +
> +static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> +                                   unsigned long address)
> +{
> +}
> +
> +static inline int change_huge_pud(struct mmu_gather *tlb,
> +                                 struct vm_area_struct *vma, pud_t *pudp,
> +                                 unsigned long addr, pgprot_t newprot,
> +                                 unsigned long cp_flags)
> +{
> +       return 0;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>  static inline int split_folio_to_list_to_order(struct folio *folio,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c10247bef08a..9a00c5955c0c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2112,6 +2112,53 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>         return ret;
>  }
>
> +/*
> + * Returns:
> + *
> + * - 0: if pud leaf changed from under us
> + * - 1: if pud can be skipped
> + * - HPAGE_PUD_NR: if pud was successfully processed
> + */
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +                   pud_t *pudp, unsigned long addr, pgprot_t newprot,
> +                   unsigned long cp_flags)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       pud_t oldpud, entry;
> +       spinlock_t *ptl;
> +
> +       tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
> +
> +       /* NUMA balancing doesn't apply to dax */
> +       if (cp_flags & MM_CP_PROT_NUMA)
> +               return 1;
> +
> +       /*
> +        * Huge entries on userfault-wp only works with anonymous, while we
> +        * don't have anonymous PUDs yet.
> +        */
> +       if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL))
> +               return 1;
> +
> +       ptl = __pud_trans_huge_lock(pudp, vma);
> +       if (!ptl)
> +               return 0;
> +
> +       /*
> +        * Can't clear PUD or it can race with concurrent zapping.  See
> +        * change_huge_pmd().
> +        */
> +       oldpud = pudp_invalidate(vma, addr, pudp);
> +       entry = pud_modify(oldpud, newprot);
> +       set_pud_at(mm, addr, pudp, entry);
> +       tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE);
> +
> +       spin_unlock(ptl);
> +       return HPAGE_PUD_NR;
> +}
> +#endif
> +
>  #ifdef CONFIG_USERFAULTFD
>  /*
>   * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
> @@ -2342,6 +2389,11 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>         spin_unlock(ptl);
>         mmu_notifier_invalidate_range_end(&range);
>  }
> +#else
> +void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> +               unsigned long address)
> +{
> +}
>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>
>  static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 2a81060b603d..694f13b83864 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -425,31 +425,53 @@ static inline long change_pud_range(struct mmu_gather *tlb,
>                 unsigned long end, pgprot_t newprot, unsigned long cp_flags)
>  {
>         struct mmu_notifier_range range;
> -       pud_t *pud;
> +       pud_t *pudp, pud;
>         unsigned long next;
>         long pages = 0, ret;
>
>         range.start = 0;
>
> -       pud = pud_offset(p4d, addr);
> +       pudp = pud_offset(p4d, addr);
>         do {
> +again:
>                 next = pud_addr_end(addr, end);
> -               ret = change_prepare(vma, pud, pmd, addr, cp_flags);
> +               ret = change_prepare(vma, pudp, pmd, addr, cp_flags);
>                 if (ret) {
>                         pages = ret;
>                         break;
>                 }
> -               if (pud_none_or_clear_bad(pud))
> +
> +               pud = READ_ONCE(*pudp);
> +               if (pud_none(pud))
>                         continue;
> +
>                 if (!range.start) {
>                         mmu_notifier_range_init(&range,
>                                                 MMU_NOTIFY_PROTECTION_VMA, 0,
>                                                 vma->vm_mm, addr, end);
>                         mmu_notifier_invalidate_range_start(&range);
>                 }
> -               pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
> +
> +               if (pud_leaf(pud)) {
> +                       if ((next - addr != PUD_SIZE) ||
> +                           pgtable_split_needed(vma, cp_flags)) {
> +                               __split_huge_pud(vma, pudp, addr);
> +                               goto again;

IIUC, most of the time, we're just going to end up clearing the PUD in
this case. __split_huge_pud() will just clear it, then we goto again
and `continue` to the next pudp. Is that ok?

(I think it's ok as long as: you never map an anonymous page with a
PUD, and that uffd-wp is not usable with non-hugetlb PUD mappings of
user memory (which I think is only DAX?). So it seems ok today...?)

Also, does the comment in pgtable_split_needed() need updating?

Somewhat related question: change_huge_pmd() is very careful not to
clear the PMD before writing the new value. Yet change_pmd_range(),
when it calls into __split_huge_pmd(), will totally clear the PMD and
then populate the PTEs underneath (in some cases at least), seemingly
reintroducing the MADV_DONTNEED concern. But your PUD version, because
it never re-populates the PUD (or PMDs/PTEs underneath) does not have
this issue. WDYT?

Thanks for this series!

> +                       } else {
> +                               ret = change_huge_pud(tlb, vma, pudp,
> +                                                     addr, newprot, cp_flags);
> +                               if (ret == 0)
> +                                       goto again;
> +                               /* huge pud was handled */
> +                               if (ret == HPAGE_PUD_NR)
> +                                       pages += HPAGE_PUD_NR;
> +                               continue;
> +                       }
> +               }
> +
> +               pages += change_pmd_range(tlb, vma, pudp, addr, next, newprot,
>                                           cp_flags);
> -       } while (pud++, addr = next, addr != end);
> +       } while (pudp++, addr = next, addr != end);
>
>         if (range.start)
>                 mmu_notifier_invalidate_range_end(&range);
> --
> 2.45.0
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] mm/mprotect: fix dax pud handlings
  2024-07-25 18:29   ` James Houghton
@ 2024-07-25 22:41     ` Peter Xu
  2024-07-26  0:23       ` James Houghton
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-07-25 22:41 UTC (permalink / raw)
  To: James Houghton
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

On Thu, Jul 25, 2024 at 11:29:49AM -0700, James Houghton wrote:
> > -               pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
> > +
> > +               if (pud_leaf(pud)) {
> > +                       if ((next - addr != PUD_SIZE) ||
> > +                           pgtable_split_needed(vma, cp_flags)) {
> > +                               __split_huge_pud(vma, pudp, addr);
> > +                               goto again;
> 
> IIUC, most of the time, we're just going to end up clearing the PUD in
> this case. __split_huge_pud() will just clear it, then we goto again
> and `continue` to the next pudp. Is that ok?
> 
> (I think it's ok as long as: you never map an anonymous page with a
> PUD,

I think this is true.

> and that uffd-wp is not usable with non-hugetlb PUD mappings of
> user memory (which I think is only DAX?).

Uffd-wp has the async mode that can even work with dax puds.. even though I
don't think anyone should be using it.  Just like I'm more sure that nobody
is using mprotect() too with dax pud, and it further justifies why nobody
cared this much..

What uffd-wp would do in this case is it'll make pgtable_split_needed()
returns true on this PUD, the PUD got wiped out, goto again, then
change_prepare() will populate this pud with a pgtable page.  Then it goes
downwards, install PMD pgtable, then probably start installing pte markers
ultimately if it's a wr-protect operation.

> So it seems ok today...?)

Yes I think it's ok so far, unless you think it's not. :)

> 
> Also, does the comment in pgtable_split_needed() need updating?

/*
 * Return true if we want to split THPs into PTE mappings in change
 * protection procedure, false otherwise.
 */

It looks to me it's ok for now to me? THP can represents PUD in dax, and we
indeed want to break THPs (no matter PUD/PMD) finally into PTE mappings.

> 
> Somewhat related question: change_huge_pmd() is very careful not to
> clear the PMD before writing the new value. Yet change_pmd_range(),
> when it calls into __split_huge_pmd(), will totally clear the PMD and
> then populate the PTEs underneath (in some cases at least), seemingly
> reintroducing the MADV_DONTNEED concern. But your PUD version, because
> it never re-populates the PUD (or PMDs/PTEs underneath) does not have
> this issue. WDYT?

Could you elaborate more on the DONTNEED issue you're mentioning here?

> 
> Thanks for this series!

Thanks for reviewing it, James.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] mm/mprotect: fix dax pud handlings
  2024-07-25 22:41     ` Peter Xu
@ 2024-07-26  0:23       ` James Houghton
  2024-07-26 11:56         ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: James Houghton @ 2024-07-26  0:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

On Thu, Jul 25, 2024 at 3:41 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jul 25, 2024 at 11:29:49AM -0700, James Houghton wrote:
> > > -               pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
> > > +
> > > +               if (pud_leaf(pud)) {
> > > +                       if ((next - addr != PUD_SIZE) ||
> > > +                           pgtable_split_needed(vma, cp_flags)) {
> > > +                               __split_huge_pud(vma, pudp, addr);
> > > +                               goto again;
> >
> > IIUC, most of the time, we're just going to end up clearing the PUD in
> > this case. __split_huge_pud() will just clear it, then we goto again
> > and `continue` to the next pudp. Is that ok?
> >
> > (I think it's ok as long as: you never map an anonymous page with a
> > PUD,
>
> I think this is true.
>
> > and that uffd-wp is not usable with non-hugetlb PUD mappings of
> > user memory (which I think is only DAX?).
>
> Uffd-wp has the async mode that can even work with dax puds.. even though I
> don't think anyone should be using it.  Just like I'm more sure that nobody
> is using mprotect() too with dax pud, and it further justifies why nobody
> cared this much..
>
> What uffd-wp would do in this case is it'll make pgtable_split_needed()
> returns true on this PUD, the PUD got wiped out, goto again, then
> change_prepare() will populate this pud with a pgtable page.  Then it goes
> downwards, install PMD pgtable, then probably start installing pte markers
> ultimately if it's a wr-protect operation.

Ah, I didn't understand this patch correctly. You're right, you'll
install PMDs underneath -- I thought we'd just find pud_none() and
bail out. So this all makes sense, thanks!

>
> > So it seems ok today...?)
>
> Yes I think it's ok so far, unless you think it's not. :)
>
> >
> > Also, does the comment in pgtable_split_needed() need updating?
>
> /*
>  * Return true if we want to split THPs into PTE mappings in change
>  * protection procedure, false otherwise.
>  */
>
> It looks to me it's ok for now to me? THP can represents PUD in dax, and we
> indeed want to break THPs (no matter PUD/PMD) finally into PTE mappings.

Oh, heh I was thinking of the other comment:

/*
* pte markers only resides in pte level, if we need pte markers,
* we need to split.  We cannot wr-protect shmem thp because file
* thp is handled differently when split by erasing the pmd so far.
*/

I guess this is fine too, it just kind of reads as if this function is
only called for PMDs. *shrug*

>
> >
> > Somewhat related question: change_huge_pmd() is very careful not to
> > clear the PMD before writing the new value. Yet change_pmd_range(),
> > when it calls into __split_huge_pmd(), will totally clear the PMD and
> > then populate the PTEs underneath (in some cases at least), seemingly
> > reintroducing the MADV_DONTNEED concern. But your PUD version, because
> > it never re-populates the PUD (or PMDs/PTEs underneath) does not have
> > this issue. WDYT?

I guess I'm wrong about this -- your PUD version is the same as the
PMD version in this respect: both clear the PUD/PMD and then install
lower page table entries.

>
> Could you elaborate more on the DONTNEED issue you're mentioning here?

In change_huge_pmd(), there is a comment about not clearing the pmd
before setting the new value. I guess the issue is: if a user is
MADV_DONTNEEDing some memory and happens to see a cleared PMD, it will
just move on. But in this case, if change_huge_pmd() temporarily
cleared a PMD, then MADV_DONTNEED saw it and moved on, and then
change_huge_pmd() installed a new PMD, the MADV_DONTNEED has basically
skipped over the PMD, probably not what the user wanted. It seems like
we have the same issue when a PMD is cleared when we're splitting it.

But yeah, given that your PUD version is apparently no different from
the PMD version in this respect, maybe this question isn't very
interesting. :)

>
> >
> > Thanks for this series!
>
> Thanks for reviewing it, James.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] mm/mprotect: fix dax pud handlings
  2024-07-26  0:23       ` James Houghton
@ 2024-07-26 11:56         ` Peter Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-07-26 11:56 UTC (permalink / raw)
  To: James Houghton
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

On Thu, Jul 25, 2024 at 05:23:48PM -0700, James Houghton wrote:
> On Thu, Jul 25, 2024 at 3:41 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Jul 25, 2024 at 11:29:49AM -0700, James Houghton wrote:
> > > > -               pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
> > > > +
> > > > +               if (pud_leaf(pud)) {
> > > > +                       if ((next - addr != PUD_SIZE) ||
> > > > +                           pgtable_split_needed(vma, cp_flags)) {
> > > > +                               __split_huge_pud(vma, pudp, addr);
> > > > +                               goto again;
> > >
> > > IIUC, most of the time, we're just going to end up clearing the PUD in
> > > this case. __split_huge_pud() will just clear it, then we goto again
> > > and `continue` to the next pudp. Is that ok?
> > >
> > > (I think it's ok as long as: you never map an anonymous page with a
> > > PUD,
> >
> > I think this is true.
> >
> > > and that uffd-wp is not usable with non-hugetlb PUD mappings of
> > > user memory (which I think is only DAX?).
> >
> > Uffd-wp has the async mode that can even work with dax puds.. even though I
> > don't think anyone should be using it.  Just like I'm more sure that nobody
> > is using mprotect() too with dax pud, and it further justifies why nobody
> > cared this much..
> >
> > What uffd-wp would do in this case is it'll make pgtable_split_needed()
> > returns true on this PUD, the PUD got wiped out, goto again, then
> > change_prepare() will populate this pud with a pgtable page.  Then it goes
> > downwards, install PMD pgtable, then probably start installing pte markers
> > ultimately if it's a wr-protect operation.
> 
> Ah, I didn't understand this patch correctly. You're right, you'll
> install PMDs underneath -- I thought we'd just find pud_none() and
> bail out. So this all makes sense, thanks!
> 
> >
> > > So it seems ok today...?)
> >
> > Yes I think it's ok so far, unless you think it's not. :)
> >
> > >
> > > Also, does the comment in pgtable_split_needed() need updating?
> >
> > /*
> >  * Return true if we want to split THPs into PTE mappings in change
> >  * protection procedure, false otherwise.
> >  */
> >
> > It looks to me it's ok for now to me? THP can represents PUD in dax, and we
> > indeed want to break THPs (no matter PUD/PMD) finally into PTE mappings.
> 
> Oh, heh I was thinking of the other comment:
> 
> /*
> * pte markers only resides in pte level, if we need pte markers,
> * we need to split.  We cannot wr-protect shmem thp because file
> * thp is handled differently when split by erasing the pmd so far.
> */
> 
> I guess this is fine too, it just kind of reads as if this function is
> only called for PMDs. *shrug*

Ah ok, yeah it looks mostly fine here too.  Let's make this easy by keeping
it as-is, but if there'll be a new version I can touch it up if it helps
the readings.

> 
> >
> > >
> > > Somewhat related question: change_huge_pmd() is very careful not to
> > > clear the PMD before writing the new value. Yet change_pmd_range(),
> > > when it calls into __split_huge_pmd(), will totally clear the PMD and
> > > then populate the PTEs underneath (in some cases at least), seemingly
> > > reintroducing the MADV_DONTNEED concern. But your PUD version, because
> > > it never re-populates the PUD (or PMDs/PTEs underneath) does not have
> > > this issue. WDYT?
> 
> I guess I'm wrong about this -- your PUD version is the same as the
> PMD version in this respect: both clear the PUD/PMD and then install
> lower page table entries.
> 
> >
> > Could you elaborate more on the DONTNEED issue you're mentioning here?
> 
> In change_huge_pmd(), there is a comment about not clearing the pmd
> before setting the new value. I guess the issue is: if a user is
> MADV_DONTNEEDing some memory and happens to see a cleared PMD, it will
> just move on. But in this case, if change_huge_pmd() temporarily
> cleared a PMD, then MADV_DONTNEED saw it and moved on, and then
> change_huge_pmd() installed a new PMD, the MADV_DONTNEED has basically
> skipped over the PMD, probably not what the user wanted. It seems like
> we have the same issue when a PMD is cleared when we're splitting it.
> 
> But yeah, given that your PUD version is apparently no different from
> the PMD version in this respect, maybe this question isn't very
> interesting. :)

Ah that one, so yeah that's why I introduced pudp_invalidate() in this
series to make sure that issue isn't there.  Then I assume we're good.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/8] mm/dax: Dump start address in fault handler
  2024-07-15 19:21 ` [PATCH v3 1/8] mm/dax: Dump start address in fault handler Peter Xu
@ 2024-07-31 12:04   ` David Hildenbrand
  2024-08-02 22:43     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-07-31 12:04 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, Oscar Salvador,
	Mel Gorman, Andrew Morton, Borislav Petkov, Christophe Leroy,
	Huang Ying, Kirill A . Shutemov, Aneesh Kumar K . V, Dan Williams,
	Thomas Gleixner, Hugh Dickins, x86, Nicholas Piggin,
	Vlastimil Babka, Ingo Molnar

On 15.07.24 21:21, Peter Xu wrote:
> Currently the dax fault handler dumps the vma range when dynamic debugging
> enabled.  That's mostly not useful.  Dump the (aligned) address instead
> with the order info.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   drivers/dax/device.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index eb61598247a9..714174844ca5 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order)
>   	int id;
>   	struct dev_dax *dev_dax = filp->private_data;
>   
> -	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm,
> -			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
> -			vmf->vma->vm_start, vmf->vma->vm_end, order);
> +	dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm,
> +		(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
> +		vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order);
>   
>   	id = dax_read_lock();
>   	if (order == 0)

Agreed, the address of the fault is better. Just wondering, would the 
unmasked address be even better? Using the order we can figure out the 
to-be-aligned address.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  2024-07-15 19:21 ` [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES Peter Xu
@ 2024-07-31 12:18   ` David Hildenbrand
  2024-08-04 15:06     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-07-31 12:18 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, Oscar Salvador,
	Mel Gorman, Andrew Morton, Borislav Petkov, Christophe Leroy,
	Huang Ying, Kirill A . Shutemov, Aneesh Kumar K . V, Dan Williams,
	Thomas Gleixner, Hugh Dickins, x86, Nicholas Piggin,
	Vlastimil Babka, Ingo Molnar, Alex Thorlton

On 15.07.24 21:21, Peter Xu wrote:
> In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages
> altered by protection changes") introduced "numa_huge_pte_updates" vmstat
> entry, trying to capture how many huge ptes (in reality, PMD thps at that
> time) are marked by NUMA balancing.
> 
> This patch proposes to remove it for some reasons.
> 
> Firstly, the name is misleading. We can have more than one way to have a
> "huge pte" at least nowadays, and that's also the major goal of this patch,
> where it paves way for PUD handling in change protection code paths.
> 
> PUDs are coming not only for dax (which has already came and yet broken..),
> but also for pfnmaps and hugetlb pages.  The name will simply stop making
> sense when PUD will start to be involved in mprotect() world.
> 
> It'll also make it not reasonable either if we boost the counter for both
> pmd/puds.  In short, current accounting won't be right when PUD comes, so
> the scheme was only suitable at that point in time where PUD wasn't even
> possible.
> 
> Secondly, the accounting was simply not right from the start as long as it
> was also affected by other call sites besides NUMA.  mprotect() is one,
> while userfaultfd-wp also leverages change protection path to modify
> pgtables.  If it wants to do right it needs to check the caller but it
> never did; at least mprotect() should be there even in 2013.
> 
> It gives me the impression that nobody is seriously using this field, and
> it's also impossible to be serious.

It's weird and the implementation is ugly. The intention really was to 
only consider MM_CP_PROT_NUMA, but that apparently is not the case.

hugetlb/mprotect/... should have never been accounted.

[...]

>   
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 73d791d1caad..53656227f70d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = {
>   
>   #ifdef CONFIG_NUMA_BALANCING
>   	"numa_pte_updates",
> -	"numa_huge_pte_updates",
>   	"numa_hint_faults",
>   	"numa_hint_faults_local",
>   	"numa_pages_migrated",

It's a user-visible update. I assume most tools should be prepared for 
this stat missing (just like handling !CONFIG_NUMA_BALANCING).

Apparently it's documented [1][2] for some distros:

"The amount of transparent huge pages that were marked for NUMA hinting 
faults. In combination with numa_pte_updates the total address space 
that was marked can be calculated."

And now I realize that change_prot_numa() would account these PMD 
updates as well in numa_pte_updates and I am confused about the SUSE 
documentation: "In combination with numa_pte_updates" doesn't really 
apply, right?

At this point I don't know what's right or wrong.

If we'd want to fix it instead, the right thing to do would be doing the 
accounting only with MM_CP_PROT_NUMA. But then, numa_pte_updates is also 
wrongly updated I believe :(


[1] 
https://documentation.suse.com/de-de/sles/12-SP5/html/SLES-all/cha-tuning-numactl.html
[2] 
https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2749259_1.html

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/8] mm/x86: Make pud_leaf() only cares about PSE bit
  2024-07-15 19:21 ` [PATCH v3 5/8] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
@ 2024-07-31 12:22   ` David Hildenbrand
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2024-07-31 12:22 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, Oscar Salvador,
	Mel Gorman, Andrew Morton, Borislav Petkov, Christophe Leroy,
	Huang Ying, Kirill A . Shutemov, Aneesh Kumar K . V, Dan Williams,
	Thomas Gleixner, Hugh Dickins, x86, Nicholas Piggin,
	Vlastimil Babka, Ingo Molnar

On 15.07.24 21:21, Peter Xu wrote:
> An entry should be reported as PUD leaf even if it's PROT_NONE, in which
> case PRESENT bit isn't there. I hit bad pud without this when testing dax
> 1G on zapping a PROT_NONE PUD.
> 

Subject s/cares/care/

> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: x86@kernel.org
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   arch/x86/include/asm/pgtable.h | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 65b8e5bb902c..25fc6d809572 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1073,8 +1073,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
>   #define pud_leaf pud_leaf
>   static inline bool pud_leaf(pud_t pud)
>   {
> -	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
> -		(_PAGE_PSE | _PAGE_PRESENT);
> +	return pud_val(pud) & _PAGE_PSE;
>   }
>   
>   static inline int pud_bad(pud_t pud)

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/8] mm/x86: arch_check_zapped_pud()
  2024-07-15 19:21 ` [PATCH v3 6/8] mm/x86: arch_check_zapped_pud() Peter Xu
@ 2024-07-31 12:23   ` David Hildenbrand
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2024-07-31 12:23 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Dave Jiang, Rik van Riel, Dave Hansen, Michael Ellerman,
	linuxppc-dev, Matthew Wilcox, Rick P Edgecombe, Oscar Salvador,
	Mel Gorman, Andrew Morton, Borislav Petkov, Christophe Leroy,
	Huang Ying, Kirill A . Shutemov, Aneesh Kumar K . V, Dan Williams,
	Thomas Gleixner, Hugh Dickins, x86, Nicholas Piggin,
	Vlastimil Babka, Ingo Molnar

On 15.07.24 21:21, Peter Xu wrote:
> Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps.
> It has the same logic of the PMD helper.
> 
> One thing to mention is, it might be a good idea to use page_table_check in
> the future for trapping wrong setups of shadow stack pgtable entries [1].
> That is left for the future as a separate effort.
> 
> [1] https://lore.kernel.org/all/59d518698f664e07c036a5098833d7b56b953305.camel@intel.com
> 
> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: x86@kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/8] mm/dax: Dump start address in fault handler
  2024-07-31 12:04   ` David Hildenbrand
@ 2024-08-02 22:43     ` Peter Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-08-02 22:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar

On Wed, Jul 31, 2024 at 02:04:38PM +0200, David Hildenbrand wrote:
> On 15.07.24 21:21, Peter Xu wrote:
> > Currently the dax fault handler dumps the vma range when dynamic debugging
> > enabled.  That's mostly not useful.  Dump the (aligned) address instead
> > with the order info.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   drivers/dax/device.c | 6 +++---
> >   1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > index eb61598247a9..714174844ca5 100644
> > --- a/drivers/dax/device.c
> > +++ b/drivers/dax/device.c
> > @@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order)
> >   	int id;
> >   	struct dev_dax *dev_dax = filp->private_data;
> > -	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm,
> > -			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
> > -			vmf->vma->vm_start, vmf->vma->vm_end, order);
> > +	dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm,
> > +		(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
> > +		vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order);
> >   	id = dax_read_lock();
> >   	if (order == 0)
> 
> Agreed, the address of the fault is better. Just wondering, would the
> unmasked address be even better? Using the order we can figure out the
> to-be-aligned address.

From my very limited dax experience on monitoring these faults and making
sure huge mappings popped up correctly.. it's so far easier when we see
address properly aligned with order info.  But let me know if you still
prefer that, I'm fine with making that calculation simpler.

> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks for taking a look!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  2024-07-31 12:18   ` David Hildenbrand
@ 2024-08-04 15:06     ` Peter Xu
  2024-08-06 13:02       ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-08-04 15:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar, Alex Thorlton

On Wed, Jul 31, 2024 at 02:18:26PM +0200, David Hildenbrand wrote:
> On 15.07.24 21:21, Peter Xu wrote:
> > In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages
> > altered by protection changes") introduced "numa_huge_pte_updates" vmstat
> > entry, trying to capture how many huge ptes (in reality, PMD thps at that
> > time) are marked by NUMA balancing.
> > 
> > This patch proposes to remove it for some reasons.
> > 
> > Firstly, the name is misleading. We can have more than one way to have a
> > "huge pte" at least nowadays, and that's also the major goal of this patch,
> > where it paves way for PUD handling in change protection code paths.
> > 
> > PUDs are coming not only for dax (which has already came and yet broken..),
> > but also for pfnmaps and hugetlb pages.  The name will simply stop making
> > sense when PUD will start to be involved in mprotect() world.
> > 
> > It'll also make it not reasonable either if we boost the counter for both
> > pmd/puds.  In short, current accounting won't be right when PUD comes, so
> > the scheme was only suitable at that point in time where PUD wasn't even
> > possible.
> > 
> > Secondly, the accounting was simply not right from the start as long as it
> > was also affected by other call sites besides NUMA.  mprotect() is one,
> > while userfaultfd-wp also leverages change protection path to modify
> > pgtables.  If it wants to do right it needs to check the caller but it
> > never did; at least mprotect() should be there even in 2013.
> > 
> > It gives me the impression that nobody is seriously using this field, and
> > it's also impossible to be serious.
> 
> It's weird and the implementation is ugly. The intention really was to only
> consider MM_CP_PROT_NUMA, but that apparently is not the case.
> 
> hugetlb/mprotect/... should have never been accounted.
> 
> [...]
> 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 73d791d1caad..53656227f70d 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = {
> >   #ifdef CONFIG_NUMA_BALANCING
> >   	"numa_pte_updates",
> > -	"numa_huge_pte_updates",
> >   	"numa_hint_faults",
> >   	"numa_hint_faults_local",
> >   	"numa_pages_migrated",
> 
> It's a user-visible update. I assume most tools should be prepared for this
> stat missing (just like handling !CONFIG_NUMA_BALANCING).
> 
> Apparently it's documented [1][2] for some distros:

Yes, and AFAIU, [2] is a document to explain an issue relevant to numa
balancing, and I'd highly doubt [2] referenced [1] here; even the order of
the parameters are the same to be listed.

> 
> "The amount of transparent huge pages that were marked for NUMA hinting
> faults. In combination with numa_pte_updates the total address space that
> was marked can be calculated."
> 
> And now I realize that change_prot_numa() would account these PMD updates as
> well in numa_pte_updates and I am confused about the SUSE documentation: "In
> combination with numa_pte_updates" doesn't really apply, right?
> 
> At this point I don't know what's right or wrong.

Me neither, even without PUD involvement.

Talking about numa_pte_updates, hugetlb_change_protection() returns "number
of huge ptes", so one 2M hugetlb page is accounted once; while comparing to
the generic THP (change_protection_range()) it's HPAGE_PUD_NR.  It'll make
more sense to me if it sticks with PAGE_SIZE.  So all these counters look a
bit confusing.

> 
> If we'd want to fix it instead, the right thing to do would be doing the
> accounting only with MM_CP_PROT_NUMA. But then, numa_pte_updates is also
> wrongly updated I believe :(

Right.

I don't have a reason to change numa_pte_updates semantics yet so far, but
here there's the problem where numa_huge_pte_updates can be ambiguous when
there is even PUD involved.

In general, I don't know how I should treat this counter in PUD path even
if NUMA isn't involved in dax yet; it can be soon involved if we move on
with using this same path for hugetlb, or when 1G thp can be possible (with
Yu Zhao's TAO?).

One other thing I can do is I drop this patch, ignore NUMA_HUGE_PTE_UPDATES
in PUD dax processing for now.  It'll work for this series, but it'll still
be a problem later.  I figured maybe we should simply drop it from now.

Thanks,

> 
> 
> [1] https://documentation.suse.com/de-de/sles/12-SP5/html/SLES-all/cha-tuning-numactl.html
> [2] https://support.oracle.com/knowledge/Oracle%20Linux%20and%20Virtualization/2749259_1.html
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  2024-08-04 15:06     ` Peter Xu
@ 2024-08-06 13:02       ` David Hildenbrand
  2024-08-06 16:26         ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-08-06 13:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar, Alex Thorlton

> Right.
> 
> I don't have a reason to change numa_pte_updates semantics yet so far, but
> here there's the problem where numa_huge_pte_updates can be ambiguous when
> there is even PUD involved.
> 
> In general, I don't know how I should treat this counter in PUD path even
> if NUMA isn't involved in dax yet; it can be soon involved if we move on
> with using this same path for hugetlb, or when 1G thp can be possible (with
> Yu Zhao's TAO?).

We shouldn't bother about it in the PUD path at all I think. Especially 
as long as NUMA hinting doesn't apply to any of what we would handle on 
the PUD path :)

> 
> One other thing I can do is I drop this patch, ignore NUMA_HUGE_PTE_UPDATES
> in PUD dax processing for now.  It'll work for this series, but it'll still
> be a problem later.  I figured maybe we should simply drop it from now.

It probably shouldn't block your other fixes and we should likely 
discuss that separately.

I agree that we should look into dropping that PMD counter completely.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  2024-08-06 13:02       ` David Hildenbrand
@ 2024-08-06 16:26         ` Peter Xu
  2024-08-06 16:32           ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-08-06 16:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar, Alex Thorlton

On Tue, Aug 06, 2024 at 03:02:00PM +0200, David Hildenbrand wrote:
> > Right.
> > 
> > I don't have a reason to change numa_pte_updates semantics yet so far, but
> > here there's the problem where numa_huge_pte_updates can be ambiguous when
> > there is even PUD involved.
> > 
> > In general, I don't know how I should treat this counter in PUD path even
> > if NUMA isn't involved in dax yet; it can be soon involved if we move on
> > with using this same path for hugetlb, or when 1G thp can be possible (with
> > Yu Zhao's TAO?).
> 
> We shouldn't bother about it in the PUD path at all I think. Especially as
> long as NUMA hinting doesn't apply to any of what we would handle on the PUD
> path :)

Hmm, I just noticed that hugetlb was never involved.. but then how about a
potential 1G THP?  Do you mean 1G THP will not be accounted in numa
balancing too even in the future?

The motivation I had this patch in this series is I want to be clear on how
I should treat this counter in pud path if it won't go.  And when people
compare the two paths we'll need to be clear why there's such difference if
I ignore it in pud path.

Per my current read on this counter, it might be an overkill to do that at
all, and it might be simpler we drop it now.

> 
> > 
> > One other thing I can do is I drop this patch, ignore NUMA_HUGE_PTE_UPDATES
> > in PUD dax processing for now.  It'll work for this series, but it'll still
> > be a problem later.  I figured maybe we should simply drop it from now.
> 
> It probably shouldn't block your other fixes and we should likely discuss
> that separately.
> 
> I agree that we should look into dropping that PMD counter completely.

No strong opinion here.  If we prefer keeping that as separate topic, I'll
drop this patch.  You're right, it's not yet relevant to the fix.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  2024-08-06 16:26         ` Peter Xu
@ 2024-08-06 16:32           ` David Hildenbrand
  2024-08-06 16:51             ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2024-08-06 16:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar, Alex Thorlton

On 06.08.24 18:26, Peter Xu wrote:
> On Tue, Aug 06, 2024 at 03:02:00PM +0200, David Hildenbrand wrote:
>>> Right.
>>>
>>> I don't have a reason to change numa_pte_updates semantics yet so far, but
>>> here there's the problem where numa_huge_pte_updates can be ambiguous when
>>> there is even PUD involved.
>>>
>>> In general, I don't know how I should treat this counter in PUD path even
>>> if NUMA isn't involved in dax yet; it can be soon involved if we move on
>>> with using this same path for hugetlb, or when 1G thp can be possible (with
>>> Yu Zhao's TAO?).
>>
>> We shouldn't bother about it in the PUD path at all I think. Especially as
>> long as NUMA hinting doesn't apply to any of what we would handle on the PUD
>> path :)
> 
> Hmm, I just noticed that hugetlb was never involved.. but then how about a
> potential 1G THP?  Do you mean 1G THP will not be accounted in numa
> balancing too even in the future?

My best guess is that you would want a separate counter for that. The 
old one was just badly named ...

72403b4a0fbd even spells out "NUMA huge PMD updates".


"NUMA huge PMD updates were the number of THP  updates which in 
combination can be used to calculate how many ptes were updated from 
userspace."

... which doesn't make sense if you don't know how "huge" the huge 
actually was. :)

> 
> The motivation I had this patch in this series is I want to be clear on how
> I should treat this counter in pud path if it won't go.  And when people
> compare the two paths we'll need to be clear why there's such difference if
> I ignore it in pud path.
> 
> Per my current read on this counter, it might be an overkill to do that at
> all, and it might be simpler we drop it now.

Fine with me. But I would send that out separately, not buried in this 
series. The we might actually get Mel to review (was he CCed?).

Up to you!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  2024-08-06 16:32           ` David Hildenbrand
@ 2024-08-06 16:51             ` Peter Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-08-06 16:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Dave Jiang, Rik van Riel, Dave Hansen,
	Michael Ellerman, linuxppc-dev, Matthew Wilcox, Rick P Edgecombe,
	Oscar Salvador, Mel Gorman, Andrew Morton, Borislav Petkov,
	Christophe Leroy, Huang Ying, Kirill A . Shutemov,
	Aneesh Kumar K . V, Dan Williams, Thomas Gleixner, Hugh Dickins,
	x86, Nicholas Piggin, Vlastimil Babka, Ingo Molnar, Alex Thorlton

On Tue, Aug 06, 2024 at 06:32:10PM +0200, David Hildenbrand wrote:
> On 06.08.24 18:26, Peter Xu wrote:
> > On Tue, Aug 06, 2024 at 03:02:00PM +0200, David Hildenbrand wrote:
> > > > Right.
> > > > 
> > > > I don't have a reason to change numa_pte_updates semantics yet so far, but
> > > > here there's the problem where numa_huge_pte_updates can be ambiguous when
> > > > there is even PUD involved.
> > > > 
> > > > In general, I don't know how I should treat this counter in PUD path even
> > > > if NUMA isn't involved in dax yet; it can be soon involved if we move on
> > > > with using this same path for hugetlb, or when 1G thp can be possible (with
> > > > Yu Zhao's TAO?).
> > > 
> > > We shouldn't bother about it in the PUD path at all I think. Especially as
> > > long as NUMA hinting doesn't apply to any of what we would handle on the PUD
> > > path :)
> > 
> > Hmm, I just noticed that hugetlb was never involved.. but then how about a
> > potential 1G THP?  Do you mean 1G THP will not be accounted in numa
> > balancing too even in the future?
> 
> My best guess is that you would want a separate counter for that. The old
> one was just badly named ...
> 
> 72403b4a0fbd even spells out "NUMA huge PMD updates".
> 
> 
> "NUMA huge PMD updates were the number of THP  updates which in combination
> can be used to calculate how many ptes were updated from userspace."
> 
> ... which doesn't make sense if you don't know how "huge" the huge actually
> was. :)
> 
> > 
> > The motivation I had this patch in this series is I want to be clear on how
> > I should treat this counter in pud path if it won't go.  And when people
> > compare the two paths we'll need to be clear why there's such difference if
> > I ignore it in pud path.
> > 
> > Per my current read on this counter, it might be an overkill to do that at
> > all, and it might be simpler we drop it now.
> 
> Fine with me. But I would send that out separately, not buried in this
> series. The we might actually get Mel to review (was he CCed?).

Yes he is.

Fair point, let's do this separately. It's just that when split I don't
feel strongly to push that patch alone..  no reason for me to push dropping
a counter that maybe some people can still use even if I don't.  More
important to me is how I should move on with PUD, then at least this is
fully discussed and ignoring is the option I'm ok.

I'll respin with this patch dropped as of now, then I'll add a comment in
the PUD patch mention this counter is ignored.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-08-06 16:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-15 19:21 [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
2024-07-15 19:21 ` [PATCH v3 1/8] mm/dax: Dump start address in fault handler Peter Xu
2024-07-31 12:04   ` David Hildenbrand
2024-08-02 22:43     ` Peter Xu
2024-07-15 19:21 ` [PATCH v3 2/8] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES Peter Xu
2024-07-31 12:18   ` David Hildenbrand
2024-08-04 15:06     ` Peter Xu
2024-08-06 13:02       ` David Hildenbrand
2024-08-06 16:26         ` Peter Xu
2024-08-06 16:32           ` David Hildenbrand
2024-08-06 16:51             ` Peter Xu
2024-07-15 19:21 ` [PATCH v3 3/8] mm/mprotect: Push mmu notifier to PUDs Peter Xu
2024-07-15 19:21 ` [PATCH v3 4/8] mm/powerpc: Add missing pud helpers Peter Xu
2024-07-15 19:21 ` [PATCH v3 5/8] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
2024-07-31 12:22   ` David Hildenbrand
2024-07-15 19:21 ` [PATCH v3 6/8] mm/x86: arch_check_zapped_pud() Peter Xu
2024-07-31 12:23   ` David Hildenbrand
2024-07-15 19:21 ` [PATCH v3 7/8] mm/x86: Add missing pud helpers Peter Xu
2024-07-15 19:21 ` [PATCH v3 8/8] mm/mprotect: fix dax pud handlings Peter Xu
2024-07-25 18:29   ` James Houghton
2024-07-25 22:41     ` Peter Xu
2024-07-26  0:23       ` James Houghton
2024-07-26 11:56         ` Peter Xu
2024-07-15 20:00 ` [PATCH v3 0/8] mm/mprotect: Fix dax puds Peter Xu
2024-07-24 15:15 ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).