linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] mm/mprotect: Fix dax puds
@ 2024-08-07 19:48 Peter Xu
  2024-08-07 19:48 ` [PATCH v4 1/7] mm/dax: Dump start address in fault handler Peter Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Peter Xu @ 2024-08-07 19:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dan Williams, Dave Jiang, Aneesh Kumar K . V,
	x86, Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

[Based on mm-unstable, commit 98808d08fc0f, Aug 7th]

v4:
- Added tags
- Dropped patch "mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES" [DavidH]
- Touched up comment in pgtable_split_needed() [James]

v1: https://lore.kernel.org/r/20240621142504.1940209-1-peterx@redhat.com
v2: https://lore.kernel.org/r/20240703212918.2417843-1-peterx@redhat.com
v3: https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com

Dax supports pud pages for a while, but mprotect on puds was missing since
the start.  This series tries to fix that by providing pud handling in
mprotect().  The goal is to add more types of pud mappings like hugetlb or
pfnmaps.  This series paves way for it by fixing known pud entries.

Considering nobody reported this until when I looked at those other types
of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
and this may not need to be backported.  I would guess whoever cares about
mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
new kernels would be fine, but I'm open to suggestions.

There're a few small things changed to teach mprotect work on PUDs. E.g. it
will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop
making sense when there can be more than one type of huge pte.  OTOH, we'll
also need to push the mmu notifiers from pmd to pud layers, which might
need some attention but so far I think it's safe.  For such details, please
refer to each patch's commit message.

The mprotect() pud process should be straightforward, as I kept it as
simple as possible.  There's no NUMA handled as dax simply doesn't support
that.  There's also no userfault involvements as file memory (even if work
with userfault-wp async mode) will need to split a pud, so pud entry
doesn't need to yet know userfault's existance (but hugetlb entries will;
that's also for later).

Tests
=====

What I did test:

- cross-build tests that I normally cover [1]

- smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
  mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
  namespaces with proper alignments, which used to throw "bad pud" but now
  it'll run through all fine.  I checked sigbus happens if with illegal
  access on protected puds.

- vmtests.

What I didn't test:

- fsdax: I wanted to also give it a shot, but only until then I noticed it
  doesn't seem to be supported (according to dax_iomap_fault(), which will
  always fallback on PUD_ORDER).  I did remember it was supported before, I
  could miss something important there.. please shoot if so.

- userfault wp-async: I also wanted to test userfault-wp async be able to
  split huge puds (here it's simply a clear_pud.. though), but it won't
  work for devdax anyway due to not allowed to do smaller than 1G faults in
  this case. So skip too.

- Power, as no hardware on hand.

Thanks,

[1] https://gitlab.com/peterx/lkb-harness/-/blob/main/config.json
[2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c
[3] https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt

Peter Xu (7):
  mm/dax: Dump start address in fault handler
  mm/mprotect: Push mmu notifier to PUDs
  mm/powerpc: Add missing pud helpers
  mm/x86: Make pud_leaf() only care about PSE bit
  mm/x86: arch_check_zapped_pud()
  mm/x86: Add missing pud helpers
  mm/mprotect: fix dax pud handlings

 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +
 arch/powerpc/mm/book3s64/pgtable.c           | 20 ++++++
 arch/x86/include/asm/pgtable.h               | 68 ++++++++++++++++---
 arch/x86/mm/pgtable.c                        | 19 ++++++
 drivers/dax/device.c                         |  6 +-
 include/linux/huge_mm.h                      | 24 +++++++
 include/linux/pgtable.h                      |  7 ++
 mm/huge_memory.c                             | 56 ++++++++++++++-
 mm/mprotect.c                                | 71 +++++++++++++-------
 9 files changed, 236 insertions(+), 38 deletions(-)

-- 
2.45.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 1/7] mm/dax: Dump start address in fault handler
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
@ 2024-08-07 19:48 ` Peter Xu
  2024-08-07 19:48 ` [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2024-08-07 19:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dan Williams, Dave Jiang, Aneesh Kumar K . V,
	x86, Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

Currently the dax fault handler dumps the vma range when dynamic debugging
enabled.  That's mostly not useful.  Dump the (aligned) address instead
with the order info.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/dax/device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 2051e4f73c8a..9c1a729cd77e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order)
 	int id;
 	struct dev_dax *dev_dax = filp->private_data;
 
-	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm,
-			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
-			vmf->vma->vm_start, vmf->vma->vm_end, order);
+	dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm,
+		(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
+		vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order);
 
 	id = dax_read_lock();
 	if (order == 0)
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
  2024-08-07 19:48 ` [PATCH v4 1/7] mm/dax: Dump start address in fault handler Peter Xu
@ 2024-08-07 19:48 ` Peter Xu
  2024-08-08 15:33   ` Sean Christopherson
  2024-08-07 19:48 ` [PATCH v4 3/7] mm/powerpc: Add missing pud helpers Peter Xu
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-07 19:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: James Houghton, kvm, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dan Williams, David Rientjes, Dave Jiang,
	Aneesh Kumar K . V, x86, Hugh Dickins, Matthew Wilcox,
	Ingo Molnar, Huang Ying, Rik van Riel, Sean Christopherson,
	Nicholas Piggin, Borislav Petkov, Kirill A . Shutemov,
	Thomas Gleixner, Vlastimil Babka, Oscar Salvador, linuxppc-dev,
	Paolo Bonzini, Andrew Morton, Rick P Edgecombe, Mel Gorman

mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
change_pmd_range").

At that time, the issue was that NUMA balancing can be applied on a huge
range of VM memory, even if nothing was populated.  The notification can be
avoided in this case if no valid pmd detected, which includes either THP or
a PTE pgtable page.

Now to pave way for PUD handling, this isn't enough.  We need to generate
mmu notifications even on PUD entries properly.  mprotect() is currently
broken on PUD (e.g., one can easily trigger kernel error with dax 1G
mappings already), this is the start to fix it.

To fix that, this patch proposes to push such notifications to the PUD
layers.

There is risk on regressing the problem Rik wanted to resolve before, but I
think it shouldn't really happen, and I still chose this solution because
of a few reasons:

  1) Consider a large VM that should definitely contain more than GBs of
  memory, it's highly likely that PUDs are also none.  In this case there
  will have no regression.

  2) KVM has evolved a lot over the years to get rid of rmap walks, which
  might be the major cause of the previous soft-lockup.  At least TDP MMU
  already got rid of rmap as long as not nested (which should be the major
  use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
  pgtable (e.g. EPT on x86), the invalidation of a full empty region in
  most cases could be pretty fast now, comparing to 2014.

  3) KVM has explicit code paths now to even give way for mmu notifiers
  just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry
  fault before acquiring mmu_lock if mapping is changing").  It'll also
  avoid contentions that may also contribute to a soft-lockup.

  4) Stick with PMD layer simply don't work when PUD is there...  We need
  one way or another to fix PUD mappings on mprotect().

Pushing it to PUD should be the safest approach as of now, e.g. there's yet
no sign of huge P4D coming on any known archs.

Cc: kvm@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 37cf8d249405..d423080e6509 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 	unsigned long next;
 	long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	struct mmu_notifier_range range;
-
-	range.start = 0;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -383,14 +380,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_none(*pmd))
 			goto next;
 
-		/* invoke the mmu notifier if the pmd is populated */
-		if (!range.start) {
-			mmu_notifier_range_init(&range,
-				MMU_NOTIFY_PROTECTION_VMA, 0,
-				vma->vm_mm, addr, end);
-			mmu_notifier_invalidate_range_start(&range);
-		}
-
 		_pmd = pmdp_get_lockless(pmd);
 		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
@@ -431,9 +420,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
 
-	if (range.start)
-		mmu_notifier_invalidate_range_end(&range);
-
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
 	return pages;
@@ -443,22 +429,36 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
+	struct mmu_notifier_range range;
 	pud_t *pud;
 	unsigned long next;
 	long pages = 0, ret;
 
+	range.start = 0;
+
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		ret = change_prepare(vma, pud, pmd, addr, cp_flags);
-		if (ret)
-			return ret;
+		if (ret) {
+			pages = ret;
+			break;
+		}
 		if (pud_none_or_clear_bad(pud))
 			continue;
+		if (!range.start) {
+			mmu_notifier_range_init(&range,
+						MMU_NOTIFY_PROTECTION_VMA, 0,
+						vma->vm_mm, addr, end);
+			mmu_notifier_invalidate_range_start(&range);
+		}
 		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
 					  cp_flags);
 	} while (pud++, addr = next, addr != end);
 
+	if (range.start)
+		mmu_notifier_invalidate_range_end(&range);
+
 	return pages;
 }
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 3/7] mm/powerpc: Add missing pud helpers
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
  2024-08-07 19:48 ` [PATCH v4 1/7] mm/dax: Dump start address in fault handler Peter Xu
  2024-08-07 19:48 ` [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
@ 2024-08-07 19:48 ` Peter Xu
  2024-08-07 19:48 ` [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit Peter Xu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2024-08-07 19:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dan Williams, Dave Jiang, Aneesh Kumar K . V,
	x86, Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

These new helpers will be needed for pud entry updates soon.  Introduce
them by referencing the pmd ones.  Namely:

- pudp_invalidate()
- pud_modify()

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +++
 arch/powerpc/mm/book3s64/pgtable.c           | 20 ++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 519b1743a0f4..5da92ba68a45 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1124,6 +1124,7 @@ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
 extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
+extern pud_t pud_modify(pud_t pud, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		       pmd_t *pmdp, pmd_t pmd);
 extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
@@ -1384,6 +1385,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
 #define __HAVE_ARCH_PMDP_INVALIDATE
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			     pmd_t *pmdp);
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			     pud_t *pudp);
 
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index f4d8d3c40e5c..5a4a75369043 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -176,6 +176,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 	return __pmd(old_pmd);
 }
 
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		      pud_t *pudp)
+{
+	unsigned long old_pud;
+
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	old_pud = pud_hugepage_update(vma->vm_mm, address, pudp, _PAGE_PRESENT, _PAGE_INVALID);
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return __pud(old_pud);
+}
+
 pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmdp, int full)
 {
@@ -259,6 +270,15 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	pmdv &= _HPAGE_CHG_MASK;
 	return pmd_set_protbits(__pmd(pmdv), newprot);
 }
+
+pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+	unsigned long pudv;
+
+	pudv = pud_val(pud);
+	pudv &= _HPAGE_CHG_MASK;
+	return pud_set_protbits(__pud(pudv), newprot);
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /* For use by kexec, called with MMU off */
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (2 preceding siblings ...)
  2024-08-07 19:48 ` [PATCH v4 3/7] mm/powerpc: Add missing pud helpers Peter Xu
@ 2024-08-07 19:48 ` Peter Xu
  2024-08-07 22:22   ` Thomas Gleixner
  2024-08-07 19:48 ` [PATCH v4 5/7] mm/x86: arch_check_zapped_pud() Peter Xu
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-07 19:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dan Williams, Dave Jiang, Aneesh Kumar K . V,
	x86, Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

An entry should be reported as PUD leaf even if it's PROT_NONE, in which
case PRESENT bit isn't there. I hit bad pud without this when testing dax
1G on zapping a PROT_NONE PUD.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e39311a89bf4..a2a3bd4c1bda 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1078,8 +1078,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
 #define pud_leaf pud_leaf
 static inline bool pud_leaf(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
-		(_PAGE_PSE | _PAGE_PRESENT);
+	return pud_val(pud) & _PAGE_PSE;
 }
 
 static inline int pud_bad(pud_t pud)
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 5/7] mm/x86: arch_check_zapped_pud()
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (3 preceding siblings ...)
  2024-08-07 19:48 ` [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit Peter Xu
@ 2024-08-07 19:48 ` Peter Xu
  2024-08-07 22:28   ` Thomas Gleixner
  2024-08-07 19:48 ` [PATCH v4 6/7] mm/x86: Add missing pud helpers Peter Xu
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-07 19:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dan Williams, Dave Jiang, Aneesh Kumar K . V,
	x86, Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps.
It has the same logic of the PMD helper.

One thing to mention is, it might be a good idea to use page_table_check in
the future for trapping wrong setups of shadow stack pgtable entries [1].
That is left for the future as a separate effort.

[1] https://lore.kernel.org/all/59d518698f664e07c036a5098833d7b56b953305.camel@intel.com

Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 10 ++++++++++
 arch/x86/mm/pgtable.c          |  7 +++++++
 include/linux/pgtable.h        |  7 +++++++
 mm/huge_memory.c               |  4 +++-
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a2a3bd4c1bda..fdb8ac9e7030 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -174,6 +174,13 @@ static inline int pud_young(pud_t pud)
 	return pud_flags(pud) & _PAGE_ACCESSED;
 }
 
+static inline bool pud_shstk(pud_t pud)
+{
+	return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+	       (pud_flags(pud) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
+	       (_PAGE_DIRTY | _PAGE_PSE);
+}
+
 static inline int pte_write(pte_t pte)
 {
 	/*
@@ -1667,6 +1674,9 @@ void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
 #define arch_check_zapped_pmd arch_check_zapped_pmd
 void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd);
 
+#define arch_check_zapped_pud arch_check_zapped_pud
+void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud);
+
 #ifdef CONFIG_XEN_PV
 #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young
 static inline bool arch_has_hw_nonleaf_pmd_young(void)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f5931499c2d6..d4b3ccf90236 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -926,3 +926,10 @@ void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd)
 	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
 			pmd_shstk(pmd));
 }
+
+void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
+{
+	/* See note in arch_check_zapped_pte() */
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+			pud_shstk(pud));
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2a6a3cccfc36..2289e9f7aa1b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -447,6 +447,13 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_check_zapped_pud
+static inline void arch_check_zapped_pud(struct vm_area_struct *vma,
+					 pud_t pud)
+{
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0024266dea0a..81c5da0708ed 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2293,12 +2293,14 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
 	spinlock_t *ptl;
+	pud_t orig_pud;
 
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
 		return 0;
 
-	pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
+	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
+	arch_check_zapped_pud(vma, orig_pud);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
 	if (vma_is_special_huge(vma)) {
 		spin_unlock(ptl);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 6/7] mm/x86: Add missing pud helpers
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (4 preceding siblings ...)
  2024-08-07 19:48 ` [PATCH v4 5/7] mm/x86: arch_check_zapped_pud() Peter Xu
@ 2024-08-07 19:48 ` Peter Xu
  2024-08-07 22:37   ` Thomas Gleixner
  2024-08-07 19:48 ` [PATCH v4 7/7] mm/mprotect: fix dax pud handlings Peter Xu
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-07 19:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dan Williams, Dave Jiang, Aneesh Kumar K . V,
	x86, Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

These new helpers will be needed for pud entry updates soon.  Introduce
these helpers by referencing the pmd ones.  Namely:

- pudp_invalidate()
- pud_modify()

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 55 +++++++++++++++++++++++++++++-----
 arch/x86/mm/pgtable.c          | 12 ++++++++
 2 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index fdb8ac9e7030..a7c1e9cfea41 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -787,6 +787,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+	return pfn_pud(pud_pfn(pud),
+		       __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -834,14 +840,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	pmd_result = __pmd(val);
 
 	/*
-	 * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid:
-	 *  1. Marking Write=0 PMDs Dirty=1
-	 *  2. Marking Dirty=1 PMDs Write=0
-	 *
-	 * The first case cannot happen because the _PAGE_CHG_MASK will filter
-	 * out any Dirty bit passed in newprot. Handle the second case by
-	 * going through the mksaveddirty exercise. Only do this if the old
-	 * value was Write=1 to avoid doing this on Shadow Stack PTEs.
+	 * Avoid creating shadow stack PMD by accident.  See comment in
+	 * pte_modify().
 	 */
 	if (oldval & _PAGE_RW)
 		pmd_result = pmd_mksaveddirty(pmd_result);
@@ -851,6 +851,29 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pmd_result;
 }
 
+static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+	pudval_t val = pud_val(pud), oldval = val;
+	pud_t pud_result;
+
+	val &= _HPAGE_CHG_MASK;
+	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
+
+	pud_result = __pud(val);
+
+	/*
+	 * Avoid creating shadow stack PUD by accident.  See comment in
+	 * pte_modify().
+	 */
+	if (oldval & _PAGE_RW)
+		pud_result = pud_mksaveddirty(pud_result);
+	else
+		pud_result = pud_clear_saveddirty(pud_result);
+
+	return pud_result;
+}
+
 /*
  * mprotect needs to preserve PAT and encryption bits when updating
  * vm_page_prot
@@ -1389,10 +1412,26 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 }
 #endif
 
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	page_table_check_pud_set(vma->vm_mm, pudp, pud);
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pudp, pud);
+	} else {
+		pud_t old = *pudp;
+		WRITE_ONCE(*pudp, pud);
+		return old;
+	}
+}
+
 #define __HAVE_ARCH_PMDP_INVALIDATE_AD
 extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
 				unsigned long address, pmd_t *pmdp);
 
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		      pud_t *pudp);
+
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index d4b3ccf90236..9fc2dabf8427 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -641,6 +641,18 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return old;
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v4 7/7] mm/mprotect: fix dax pud handlings
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (5 preceding siblings ...)
  2024-08-07 19:48 ` [PATCH v4 6/7] mm/x86: Add missing pud helpers Peter Xu
@ 2024-08-07 19:48 ` Peter Xu
  2024-08-07 21:17 ` [PATCH v4 0/7] mm/mprotect: Fix dax puds Andrew Morton
  2024-08-07 21:23 ` Andrew Morton
  8 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2024-08-07 19:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dan Williams, Dave Jiang, Aneesh Kumar K . V,
	x86, Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

This is only relevant to the two archs that support PUD dax, aka, x86_64
and ppc64.  PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not
count in this case.

DAX have had PUD mappings for years, but change protection path never
worked.  When the path is triggered in any form (a simple test program
would be: call mprotect() on a 1G dev_dax mapping), the kernel will report
"bad pud".  This patch should fix that.

The new change_huge_pud() tries to keep everything simple.  For example, it
doesn't optimize write bit as that will need even more PUD helpers.  It's
not too bad anyway to have one more write fault in the worst case once for
1G range; may be a bigger thing for each PAGE_SIZE, though.  Neither does
it support userfault-wp bits, as there isn't such PUD mappings that is
supported; file mappings always need a split there.

The same to TLB shootdown: the pmd path (which was for x86 only) has the
trick of using _ad() version of pmdp_invalidate*() which can avoid one
redundant TLB, but let's also leave that for later.  Again, the larger the
mapping, the smaller of such effect.

There's some difference on handling "retry" for change_huge_pud() (where it
can return 0): it isn't like change_huge_pmd(), as the pmd version is safe
with all conditions handled in change_pte_range() later, thanks to Hugh's
new pte_offset_map_lock().  In short, change_pte_range() is simply smarter.
For that, change_pud_range() will need proper retry if it races with
something else when a huge PUD changed from under us.

The last thing to mention is currently the PUD path ignores the huge pte
numa counter (NUMA_HUGE_PTE_UPDATES), not only because DAX is not
applicable to NUMA, but also that it's ambiguous on its own to decide how
to account pud in this case.  In one earlier version of this patchset I
proposed to remove the counter as it doesn't even look right to do the
accounting as of now [1], but then a further discussion suggests we can
leave that for later, as that doesn't block this series if we choose to
ignore that counter.  That's what this patch does, by ignoring it.

When at it, touch up the comment in pgtable_split_needed() to make it
generic to either pmd or pud file THPs.

[1] https://lore.kernel.org/all/20240715192142.3241557-3-peterx@redhat.com/
[2] https://lore.kernel.org/r/added2d0-b8be-4108-82ca-1367a388d0b1@redhat.com

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 24 +++++++++++++++++++
 mm/huge_memory.c        | 52 +++++++++++++++++++++++++++++++++++++++++
 mm/mprotect.c           | 39 ++++++++++++++++++++++++-------
 3 files changed, 107 insertions(+), 8 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ce44caa40eed..6370026689e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -342,6 +342,17 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long address);
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags);
+#else
+static inline int
+change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		unsigned long cp_flags) { return 0; }
+#endif
+
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
@@ -585,6 +596,19 @@ static inline int next_order(unsigned long *orders, int prev)
 {
 	return 0;
 }
+
+static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+				    unsigned long address)
+{
+}
+
+static inline int change_huge_pud(struct mmu_gather *tlb,
+				  struct vm_area_struct *vma, pud_t *pudp,
+				  unsigned long addr, pgprot_t newprot,
+				  unsigned long cp_flags)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81c5da0708ed..0aafd26d7a53 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2114,6 +2114,53 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return ret;
 }
 
+/*
+ * Returns:
+ *
+ * - 0: if pud leaf changed from under us
+ * - 1: if pud can be skipped
+ * - HPAGE_PUD_NR: if pud was successfully processed
+ */
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t oldpud, entry;
+	spinlock_t *ptl;
+
+	tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
+	/* NUMA balancing doesn't apply to dax */
+	if (cp_flags & MM_CP_PROT_NUMA)
+		return 1;
+
+	/*
+	 * Huge entries on userfault-wp only works with anonymous, while we
+	 * don't have anonymous PUDs yet.
+	 */
+	if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL))
+		return 1;
+
+	ptl = __pud_trans_huge_lock(pudp, vma);
+	if (!ptl)
+		return 0;
+
+	/*
+	 * Can't clear PUD or it can race with concurrent zapping.  See
+	 * change_huge_pmd().
+	 */
+	oldpud = pudp_invalidate(vma, addr, pudp);
+	entry = pud_modify(oldpud, newprot);
+	set_pud_at(mm, addr, pudp, entry);
+	tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE);
+
+	spin_unlock(ptl);
+	return HPAGE_PUD_NR;
+}
+#endif
+
 #ifdef CONFIG_USERFAULTFD
 /*
  * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
@@ -2344,6 +2391,11 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 }
+#else
+void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+		unsigned long address)
+{
+}
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d423080e6509..446f8e5f10d9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -302,8 +302,9 @@ pgtable_split_needed(struct vm_area_struct *vma, unsigned long cp_flags)
 {
 	/*
 	 * pte markers only resides in pte level, if we need pte markers,
-	 * we need to split.  We cannot wr-protect shmem thp because file
-	 * thp is handled differently when split by erasing the pmd so far.
+	 * we need to split.  For example, we cannot wr-protect a file thp
+	 * (e.g. 2M shmem) because file thp is handled differently when
+	 * split by erasing the pmd so far.
 	 */
 	return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma);
 }
@@ -430,31 +431,53 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mmu_notifier_range range;
-	pud_t *pud;
+	pud_t *pudp, pud;
 	unsigned long next;
 	long pages = 0, ret;
 
 	range.start = 0;
 
-	pud = pud_offset(p4d, addr);
+	pudp = pud_offset(p4d, addr);
 	do {
+again:
 		next = pud_addr_end(addr, end);
-		ret = change_prepare(vma, pud, pmd, addr, cp_flags);
+		ret = change_prepare(vma, pudp, pmd, addr, cp_flags);
 		if (ret) {
 			pages = ret;
 			break;
 		}
-		if (pud_none_or_clear_bad(pud))
+
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
 			continue;
+
 		if (!range.start) {
 			mmu_notifier_range_init(&range,
 						MMU_NOTIFY_PROTECTION_VMA, 0,
 						vma->vm_mm, addr, end);
 			mmu_notifier_invalidate_range_start(&range);
 		}
-		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
+
+		if (pud_leaf(pud)) {
+			if ((next - addr != PUD_SIZE) ||
+			    pgtable_split_needed(vma, cp_flags)) {
+				__split_huge_pud(vma, pudp, addr);
+				goto again;
+			} else {
+				ret = change_huge_pud(tlb, vma, pudp,
+						      addr, newprot, cp_flags);
+				if (ret == 0)
+					goto again;
+				/* huge pud was handled */
+				if (ret == HPAGE_PUD_NR)
+					pages += HPAGE_PUD_NR;
+				continue;
+			}
+		}
+
+		pages += change_pmd_range(tlb, vma, pudp, addr, next, newprot,
 					  cp_flags);
-	} while (pud++, addr = next, addr != end);
+	} while (pudp++, addr = next, addr != end);
 
 	if (range.start)
 		mmu_notifier_invalidate_range_end(&range);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] mm/mprotect: Fix dax puds
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (6 preceding siblings ...)
  2024-08-07 19:48 ` [PATCH v4 7/7] mm/mprotect: fix dax pud handlings Peter Xu
@ 2024-08-07 21:17 ` Andrew Morton
  2024-08-07 21:34   ` Peter Xu
  2024-08-07 21:23 ` Andrew Morton
  8 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2024-08-07 21:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Thomas Gleixner, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	Nicholas Piggin, Borislav Petkov, Kirill A . Shutemov,
	Dan Williams, Vlastimil Babka, Oscar Salvador, linuxppc-dev,
	linux-kernel, Aneesh Kumar K . V, Rick P Edgecombe, Mel Gorman

On Wed,  7 Aug 2024 15:48:04 -0400 Peter Xu <peterx@redhat.com> wrote:

> 
> Dax supports pud pages for a while, but mprotect on puds was missing since
> the start.  This series tries to fix that by providing pud handling in
> mprotect().  The goal is to add more types of pud mappings like hugetlb or
> pfnmaps.  This series paves way for it by fixing known pud entries.
> 
> Considering nobody reported this until when I looked at those other types
> of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
> and this may not need to be backported.  I would guess whoever cares about
> mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
> new kernels would be fine, but I'm open to suggestions.

Yes, I'm not sure this is a "fix" at all.  We're implementing something
which previously wasn't there.  Perhaps the entire series should be
called "mm: implement mprotect() for DAX PUDs"?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] mm/mprotect: Fix dax puds
  2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (7 preceding siblings ...)
  2024-08-07 21:17 ` [PATCH v4 0/7] mm/mprotect: Fix dax puds Andrew Morton
@ 2024-08-07 21:23 ` Andrew Morton
  2024-08-07 21:47   ` Peter Xu
  8 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2024-08-07 21:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Thomas Gleixner, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	Nicholas Piggin, Borislav Petkov, Kirill A . Shutemov,
	Dan Williams, Vlastimil Babka, Oscar Salvador, linuxppc-dev,
	linux-kernel, Aneesh Kumar K . V, Rick P Edgecombe, Mel Gorman

On Wed,  7 Aug 2024 15:48:04 -0400 Peter Xu <peterx@redhat.com> wrote:

> 
> Tests
> =====
> 
> What I did test:
> 
> - cross-build tests that I normally cover [1]
> 
> - smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
>   mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
>   namespaces with proper alignments, which used to throw "bad pud" but now
>   it'll run through all fine.  I checked sigbus happens if with illegal
>   access on protected puds.
> 
> - vmtests.
> 
> What I didn't test:
> 
> - fsdax: I wanted to also give it a shot, but only until then I noticed it
>   doesn't seem to be supported (according to dax_iomap_fault(), which will
>   always fallback on PUD_ORDER).  I did remember it was supported before, I
>   could miss something important there.. please shoot if so.

OK.  Who are you addressing this question to?

> - userfault wp-async: I also wanted to test userfault-wp async be able to
>   split huge puds (here it's simply a clear_pud.. though), but it won't
>   work for devdax anyway due to not allowed to do smaller than 1G faults in
>   this case. So skip too.

Sounds OK.  So that's an additional project if anyone cares enough?

> - Power, as no hardware on hand.

Hopefully the powerpc people can help with that.  What tests do you ask
that they run?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] mm/mprotect: Fix dax puds
  2024-08-07 21:17 ` [PATCH v4 0/7] mm/mprotect: Fix dax puds Andrew Morton
@ 2024-08-07 21:34   ` Peter Xu
  2024-08-07 21:44     ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-07 21:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Thomas Gleixner, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	Nicholas Piggin, Borislav Petkov, Kirill A . Shutemov,
	Dan Williams, Vlastimil Babka, Oscar Salvador, linuxppc-dev,
	linux-kernel, Aneesh Kumar K . V, Rick P Edgecombe, Mel Gorman

On Wed, Aug 07, 2024 at 02:17:03PM -0700, Andrew Morton wrote:
> On Wed,  7 Aug 2024 15:48:04 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > 
> > Dax supports pud pages for a while, but mprotect on puds was missing since
> > the start.  This series tries to fix that by providing pud handling in
> > mprotect().  The goal is to add more types of pud mappings like hugetlb or
> > pfnmaps.  This series paves way for it by fixing known pud entries.
> > 
> > Considering nobody reported this until when I looked at those other types
> > of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
> > and this may not need to be backported.  I would guess whoever cares about
> > mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
> > new kernels would be fine, but I'm open to suggestions.
> 
> Yes, I'm not sure this is a "fix" at all.  We're implementing something
> which previously wasn't there.  Perhaps the entire series should be
> called "mm: implement mprotect() for DAX PUDs"?

The problem is mprotect() will skip the dax 1G PUD while it shouldn't;
meanwhile it'll dump some bad PUD in dmesg.  Both of them look like (corner
case) bugs to me.. where:

  - skipping the 1G pud means mprotect() will succeed even if the pud won't
    be updated with the correct permission specified. Logically that can
    cause e.g. in mprotect(RO) then write the page can cause data corrupt,
    as the pud page will still be writable.

  - the bad pud will generate a pr_err() into dmesg, with no limit so far I
    can see.  So I think it means an userspace can DoS the kernel log if it
    wants.. simply by creating the PUD and keep mprotect-ing it

But yeah this series fixes this "bug" by implementing that part..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] mm/mprotect: Fix dax puds
  2024-08-07 21:34   ` Peter Xu
@ 2024-08-07 21:44     ` Andrew Morton
  2024-08-08 14:34       ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2024-08-07 21:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Thomas Gleixner, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	Nicholas Piggin, Borislav Petkov, Kirill A . Shutemov,
	Dan Williams, Vlastimil Babka, Oscar Salvador, linuxppc-dev,
	linux-kernel, Aneesh Kumar K . V, Rick P Edgecombe, Mel Gorman

On Wed, 7 Aug 2024 17:34:10 -0400 Peter Xu <peterx@redhat.com> wrote:

> The problem is mprotect() will skip the dax 1G PUD while it shouldn't;
> meanwhile it'll dump some bad PUD in dmesg.  Both of them look like (corner
> case) bugs to me.. where:
> 
>   - skipping the 1G pud means mprotect() will succeed even if the pud won't
>     be updated with the correct permission specified. Logically that can
>     cause e.g. in mprotect(RO) then write the page can cause data corrupt,
>     as the pud page will still be writable.
> 
>   - the bad pud will generate a pr_err() into dmesg, with no limit so far I
>     can see.  So I think it means an userspace can DoS the kernel log if it
>     wants.. simply by creating the PUD and keep mprotect-ing it
> 

I edited this important info into the [0/n] text, thanks.

So current kernels can be made to spew into the kernel logs?  That's
considered serious.  Can unprivileged userspace code do this?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] mm/mprotect: Fix dax puds
  2024-08-07 21:23 ` Andrew Morton
@ 2024-08-07 21:47   ` Peter Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2024-08-07 21:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Thomas Gleixner, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	Nicholas Piggin, Borislav Petkov, Kirill A . Shutemov,
	Dan Williams, Vlastimil Babka, Oscar Salvador, linuxppc-dev,
	linux-kernel, Aneesh Kumar K . V, Rick P Edgecombe, Mel Gorman

On Wed, Aug 07, 2024 at 02:23:16PM -0700, Andrew Morton wrote:
> On Wed,  7 Aug 2024 15:48:04 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > 
> > Tests
> > =====
> > 
> > What I did test:
> > 
> > - cross-build tests that I normally cover [1]
> > 
> > - smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
> >   mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
> >   namespaces with proper alignments, which used to throw "bad pud" but now
> >   it'll run through all fine.  I checked sigbus happens if with illegal
> >   access on protected puds.
> > 
> > - vmtests.
> > 
> > What I didn't test:
> > 
> > - fsdax: I wanted to also give it a shot, but only until then I noticed it
> >   doesn't seem to be supported (according to dax_iomap_fault(), which will
> >   always fallback on PUD_ORDER).  I did remember it was supported before, I
> >   could miss something important there.. please shoot if so.
> 
> OK.  Who are you addressing this question to?

Anyone who is familiar with fsdax + 1g.  Maybe Matthew would be the most
suitable, but I didn't track further on fsdax.

> 
> > - userfault wp-async: I also wanted to test userfault-wp async be able to
> >   split huge puds (here it's simply a clear_pud.. though), but it won't
> >   work for devdax anyway due to not allowed to do smaller than 1G faults in
> >   this case. So skip too.
> 
> Sounds OK.  So that's an additional project if anyone cares enough?

Right.

> 
> > - Power, as no hardware on hand.
> 
> Hopefully the powerpc people can help with that.  What tests do you ask
> that they run?

The test program [2] in cover letter should work as a very basic test; one
needs to setup the dax device to use 1g mapping first, though:

[2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c

At least per my experience not much fancy things we can do there, e.g., I
think at least dev_dax has a limitation on vma split that it must be 1g
aligned when use 1g mappings, so even split can't happen (as iirc I used to
try some random mprotect on smaller ranges)..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit
  2024-08-07 19:48 ` [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit Peter Xu
@ 2024-08-07 22:22   ` Thomas Gleixner
  2024-08-08 14:54     ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2024-08-07 22:22 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dave Jiang, Aneesh Kumar K . V, x86,
	Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Dan Williams, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
> An entry should be reported as PUD leaf even if it's PROT_NONE, in which
> case PRESENT bit isn't there. I hit bad pud without this when testing dax
> 1G on zapping a PROT_NONE PUD.

That does not qualify as a change log. What you hit is irrelevant unless
you explain the actual underlying problem. See Documentation/process/
including the TIP documentation.

> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index e39311a89bf4..a2a3bd4c1bda 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1078,8 +1078,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
>  #define pud_leaf pud_leaf
>  static inline bool pud_leaf(pud_t pud)
>  {
> -	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
> -		(_PAGE_PSE | _PAGE_PRESENT);
> +	return pud_val(pud) & _PAGE_PSE;
>  }

And the changelog does not explain why this change is not affecting any
existing user of pud_leaf().

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 5/7] mm/x86: arch_check_zapped_pud()
  2024-08-07 19:48 ` [PATCH v4 5/7] mm/x86: arch_check_zapped_pud() Peter Xu
@ 2024-08-07 22:28   ` Thomas Gleixner
  2024-08-08 15:49     ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2024-08-07 22:28 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dave Jiang, Aneesh Kumar K . V, x86,
	Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Dan Williams, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:

> Subject: mm/x86: arch_check_zapped_pud()

Is not a proper subject line. It clearly lacks a verb.

  Subject: mm/x86: Implement arch_check_zapped_pud()


> Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps.
> It has the same logic of the PMD helper.

s/of/as/

> +
> +void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
> +{
> +	/* See note in arch_check_zapped_pte() */
> +	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
> +			pud_shstk(pud));

Please get rid of the line break. You have 100 characters.

> +}
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 2a6a3cccfc36..2289e9f7aa1b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -447,6 +447,13 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +#ifndef arch_check_zapped_pud
> +static inline void arch_check_zapped_pud(struct vm_area_struct *vma,
> +					 pud_t pud)
> +{

Ditto..

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0024266dea0a..81c5da0708ed 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c

Why is a mm change burried in a patch which is named mm/x86?

It's clearly documented that core changes with the generic fallback come
in one patch and the architecture override in a separate one afterwards.

Do we write documentation just for the sake of writing it?

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 6/7] mm/x86: Add missing pud helpers
  2024-08-07 19:48 ` [PATCH v4 6/7] mm/x86: Add missing pud helpers Peter Xu
@ 2024-08-07 22:37   ` Thomas Gleixner
  2024-08-08 20:25     ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2024-08-07 22:37 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: James Houghton, David Hildenbrand, Dave Hansen, peterx,
	Christophe Leroy, Dave Jiang, Aneesh Kumar K . V, x86,
	Hugh Dickins, Matthew Wilcox, Ingo Molnar, Huang Ying,
	Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Dan Williams, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, Andrew Morton, Rick P Edgecombe,
	Mel Gorman

On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
> These new helpers will be needed for pud entry updates soon.  Introduce
> these helpers by referencing the pmd ones.  Namely:
>
> - pudp_invalidate()
> - pud_modify()

Zero content about what these helpers do and why they are needed. That's
not how it works, really.

  
> +static inline pud_t pud_mkinvalid(pud_t pud)
> +{
> +	return pfn_pud(pud_pfn(pud),
> +		       __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));

100 characters...

> +}
> +
>  static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
>  
>  static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> @@ -834,14 +840,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>  	pmd_result = __pmd(val);
>  
>  	/*
> -	 * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid:
> -	 *  1. Marking Write=0 PMDs Dirty=1
> -	 *  2. Marking Dirty=1 PMDs Write=0
> -	 *
> -	 * The first case cannot happen because the _PAGE_CHG_MASK will filter
> -	 * out any Dirty bit passed in newprot. Handle the second case by
> -	 * going through the mksaveddirty exercise. Only do this if the old
> -	 * value was Write=1 to avoid doing this on Shadow Stack PTEs.
> +	 * Avoid creating shadow stack PMD by accident.  See comment in
> +	 * pte_modify().

The changelog is utterly silent about this comment update.

>  	 */
>  	if (oldval & _PAGE_RW)
>  		pmd_result = pmd_mksaveddirty(pmd_result);
> @@ -851,6 +851,29 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>  	return pmd_result;
>  }
>  
> +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> +{
> +	pudval_t val = pud_val(pud), oldval = val;
> +	pud_t pud_result;
> +
> +	val &= _HPAGE_CHG_MASK;
> +	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> +	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> +
> +	pud_result = __pud(val);
> +
> +	/*
> +	 * Avoid creating shadow stack PUD by accident.  See comment in
> +	 * pte_modify().
> +	 */
> +	if (oldval & _PAGE_RW)
> +		pud_result = pud_mksaveddirty(pud_result);
> +	else
> +		pud_result = pud_clear_saveddirty(pud_result);
> +
> +	return pud_result;
> +}
> +
>  /*
>   * mprotect needs to preserve PAT and encryption bits when updating
>   * vm_page_prot
> @@ -1389,10 +1412,26 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> +		unsigned long address, pud_t *pudp, pud_t pud)

Random line break alignment.... See documentation.

> +{
> +	page_table_check_pud_set(vma->vm_mm, pudp, pud);
> +	if (IS_ENABLED(CONFIG_SMP)) {
> +		return xchg(pudp, pud);
> +	} else {
> +		pud_t old = *pudp;
> +		WRITE_ONCE(*pudp, pud);

Lacks a newline between variable declaration and code.

But seriously, why optimizing for !SMP? That's a pointless exercise and
a guarantee for bitrot.

> +		return old;
> +	}
> +}
> +
>  #define __HAVE_ARCH_PMDP_INVALIDATE_AD
>  extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
>  				unsigned long address, pmd_t *pmdp);
>  
> +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +		      pud_t *pudp);

While 'extern' is not required, please keep the file style consistent
and use the 100 characters...

> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -641,6 +641,18 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
>  }
>  #endif
>  
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +		     pud_t *pudp)
> +{
> +	VM_WARN_ON_ONCE(!pud_present(*pudp));
> +	pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
> +	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
> +	return old;

Your keyboard clearly lacks a newline key ...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/7] mm/mprotect: Fix dax puds
  2024-08-07 21:44     ` Andrew Morton
@ 2024-08-08 14:34       ` Peter Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2024-08-08 14:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Thomas Gleixner, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	Nicholas Piggin, Borislav Petkov, Kirill A . Shutemov,
	Dan Williams, Vlastimil Babka, Oscar Salvador, linuxppc-dev,
	linux-kernel, Aneesh Kumar K . V, Rick P Edgecombe, Mel Gorman

On Wed, Aug 07, 2024 at 02:44:54PM -0700, Andrew Morton wrote:
> On Wed, 7 Aug 2024 17:34:10 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > The problem is mprotect() will skip the dax 1G PUD while it shouldn't;
> > meanwhile it'll dump some bad PUD in dmesg.  Both of them look like (corner
> > case) bugs to me.. where:
> > 
> >   - skipping the 1G pud means mprotect() will succeed even if the pud won't
> >     be updated with the correct permission specified. Logically that can
> >     cause e.g. in mprotect(RO) then write the page can cause data corrupt,
> >     as the pud page will still be writable.
> > 
> >   - the bad pud will generate a pr_err() into dmesg, with no limit so far I
> >     can see.  So I think it means an userspace can DoS the kernel log if it
> >     wants.. simply by creating the PUD and keep mprotect-ing it
> > 
> 
> I edited this important info into the [0/n] text, thanks.
> 
> So current kernels can be made to spew into the kernel logs?  That's

I suppose yes to this one.

> considered serious.  Can unprivileged userspace code do this?

AFAIU, /dev/dax* require root privilege by default, so looks not.  But
anyone more familiar with real life dax usages please correct me otherwise.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit
  2024-08-07 22:22   ` Thomas Gleixner
@ 2024-08-08 14:54     ` Peter Xu
  2024-08-09 12:08       ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-08 14:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dave Jiang, x86, Hugh Dickins, Matthew Wilcox,
	Ingo Molnar, Huang Ying, Rik van Riel, Nicholas Piggin,
	Borislav Petkov, Kirill A . Shutemov, Dan Williams,
	Vlastimil Babka, Oscar Salvador, linuxppc-dev, linux-kernel,
	Aneesh Kumar K . V, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Thu, Aug 08, 2024 at 12:22:38AM +0200, Thomas Gleixner wrote:
> On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
> > An entry should be reported as PUD leaf even if it's PROT_NONE, in which
> > case PRESENT bit isn't there. I hit bad pud without this when testing dax
> > 1G on zapping a PROT_NONE PUD.
> 
> That does not qualify as a change log. What you hit is irrelevant unless
> you explain the actual underlying problem. See Documentation/process/
> including the TIP documentation.

Firstly, thanks a lot for the reviews.

I thought the commit message explained exactly what is the underlying
problem, no?

The problem is even if PROT_NONE, as long as the PSE bit is set on the PUD
it should be treated as a PUD leaf.  Currently, the code will return
pud_leaf()==false for those PROT_NONE PUD entries, and IMHO that is wrong.
This patch wants to make it right.  I still think that's mostly what I put
there in the commit message..

Would you please suggest something so I can try to make it better,
otherwise?  Or it'll be helpful too if you could point out which part of
the two documentations I should reference.

> 
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index e39311a89bf4..a2a3bd4c1bda 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -1078,8 +1078,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
> >  #define pud_leaf pud_leaf
> >  static inline bool pud_leaf(pud_t pud)
> >  {
> > -	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
> > -		(_PAGE_PSE | _PAGE_PRESENT);
> > +	return pud_val(pud) & _PAGE_PSE;
> >  }
> 
> And the changelog does not explain why this change is not affecting any
> existing user of pud_leaf().

That's what I want to do: I want to affect them..

And IMHO it's mostly fine before because mprotect() is broken with 1g
anyway, and I guess nobody managed to populate any pud entry with PROT_NONE
on dax 1g before, and that's what this whole series is trying to fix.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-07 19:48 ` [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
@ 2024-08-08 15:33   ` Sean Christopherson
  2024-08-08 21:21     ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2024-08-08 15:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, kvm, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dan Williams, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	David Rientjes, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, linux-kernel, Aneesh Kumar K . V,
	Paolo Bonzini, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Wed, Aug 07, 2024, Peter Xu wrote:
> mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> change_pmd_range").
> 
> At that time, the issue was that NUMA balancing can be applied on a huge
> range of VM memory, even if nothing was populated.  The notification can be
> avoided in this case if no valid pmd detected, which includes either THP or
> a PTE pgtable page.
> 
> Now to pave way for PUD handling, this isn't enough.  We need to generate
> mmu notifications even on PUD entries properly.  mprotect() is currently
> broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> mappings already), this is the start to fix it.
> 
> To fix that, this patch proposes to push such notifications to the PUD
> layers.
> 
> There is risk on regressing the problem Rik wanted to resolve before, but I
> think it shouldn't really happen, and I still chose this solution because
> of a few reasons:
> 
>   1) Consider a large VM that should definitely contain more than GBs of
>   memory, it's highly likely that PUDs are also none.  In this case there

I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
none?

>   will have no regression.
> 
>   2) KVM has evolved a lot over the years to get rid of rmap walks, which
>   might be the major cause of the previous soft-lockup.  At least TDP MMU
>   already got rid of rmap as long as not nested (which should be the major
>   use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
>   pgtable (e.g. EPT on x86), the invalidation of a full empty region in
>   most cases could be pretty fast now, comparing to 2014.

The TDP MMU will indeed be a-ok.  It only zaps leaf SPTEs in response to
mmu_notifier invalidations, and checks NEED_RESCHED after processing each SPTE,
i.e. KVM won't zap an entire PUD and get stuck processing all its children.

I doubt the shadow MMU will fair much better than it did years ago though, AFAICT
the relevant code hasn't changed.  E.g. when zapping a large range in response to
an mmu_notifier invalidation, KVM never yields even if blocking is allowed.  That 
said, it is stupidly easy to fix the soft lockup problem in the shadow MMU.  KVM
already has an rmap walk path that plays nice with NEED_RESCHED *and* zaps rmaps,
but because of how things grew organically over the years, KVM never adopted the
cond_resched() logic for the mmu_notifier path.

As a bonus, now the .change_pte() is gone, the only other usage of x86's
kvm_handle_gfn_range() is for the aging mmu_notifiers, and I want to move those
to their own flow too[*], i.e. kvm_handle_gfn_range() in its current form can
be removed entirely.

I'll post a separate series, I don't think it needs to block this work, and I'm
fairly certain I can get this done for 6.12 (shouldn't be a large or scary series,
though I may tack on my lockless aging idea as an RFC).

https://lore.kernel.org/all/Zo137P7BFSxAutL2@google.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 5/7] mm/x86: arch_check_zapped_pud()
  2024-08-07 22:28   ` Thomas Gleixner
@ 2024-08-08 15:49     ` Peter Xu
  2024-08-08 20:45       ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-08 15:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dave Jiang, x86, Hugh Dickins, Matthew Wilcox,
	Ingo Molnar, Huang Ying, Rik van Riel, Nicholas Piggin,
	Borislav Petkov, Kirill A . Shutemov, Dan Williams,
	Vlastimil Babka, Oscar Salvador, linuxppc-dev, linux-kernel,
	Aneesh Kumar K . V, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Thu, Aug 08, 2024 at 12:28:47AM +0200, Thomas Gleixner wrote:
> On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
> 
> > Subject: mm/x86: arch_check_zapped_pud()
> 
> Is not a proper subject line. It clearly lacks a verb.
> 
>   Subject: mm/x86: Implement arch_check_zapped_pud()
> 
> 
> > Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps.
> > It has the same logic of the PMD helper.
> 
> s/of/as/

Will fix above two.

> 
> > +
> > +void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
> > +{
> > +	/* See note in arch_check_zapped_pte() */
> > +	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
> > +			pud_shstk(pud));
> 
> Please get rid of the line break. You have 100 characters.

Coding-style.rst still tells me 80 here:

        The preferred limit on the length of a single line is 80 columns.

        Statements longer than 80 columns should be broken into sensible chunks,
        unless exceeding 80 columns significantly increases readability and does
        not hide information.

Maybe this just changed very recently so even not in mm-unstable?

I'll fix the two line-wrap in this patch anyway, as I figured these two
cases even didn't hit 80..  probably because I used fill-column=75 locally..

But still I'll probably need to figure it out for other spots.  Please help
me to justify.

> 
> > +}
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 2a6a3cccfc36..2289e9f7aa1b 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -447,6 +447,13 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
> >  }
> >  #endif
> >  
> > +#ifndef arch_check_zapped_pud
> > +static inline void arch_check_zapped_pud(struct vm_area_struct *vma,
> > +					 pud_t pud)
> > +{
> 
> Ditto..
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 0024266dea0a..81c5da0708ed 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> 
> Why is a mm change burried in a patch which is named mm/x86?
> 
> It's clearly documented that core changes with the generic fallback come
> in one patch and the architecture override in a separate one afterwards.
> 
> Do we write documentation just for the sake of writing it?

Hmm if that's the rule, in practise I bet many patches are violating that
rule that we merged and whatever patches I used to read.. but I see now,
I'll break this patch into two.

Personally I'd really still prefer it to be one patch especially when it's
only implemented in x86, then I copy x86+mm maintainers all here and it
explains why it's there.  So please let me know if you think it may still
make sense to keep this patch as-is, or I'll split by default.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 6/7] mm/x86: Add missing pud helpers
  2024-08-07 22:37   ` Thomas Gleixner
@ 2024-08-08 20:25     ` Peter Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2024-08-08 20:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dave Jiang, x86, Hugh Dickins, Matthew Wilcox,
	Ingo Molnar, Huang Ying, Rik van Riel, Nicholas Piggin,
	Borislav Petkov, Kirill A . Shutemov, Dan Williams,
	Vlastimil Babka, Oscar Salvador, linuxppc-dev, linux-kernel,
	Aneesh Kumar K . V, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Thu, Aug 08, 2024 at 12:37:21AM +0200, Thomas Gleixner wrote:
> On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
> > These new helpers will be needed for pud entry updates soon.  Introduce
> > these helpers by referencing the pmd ones.  Namely:
> >
> > - pudp_invalidate()
> > - pud_modify()
> 
> Zero content about what these helpers do and why they are needed. That's
> not how it works, really.

I hope what Dave suggested to add "by referencing the pmd ones" would be
helpful, because they are exactly referencing those.

But sure.. I rewrote the commit message as following:

    mm/x86: Add missing pud helpers

    Some new helpers will be needed for pud entry updates soon.  Introduce
    these helpers by referencing the pmd ones.  Namely:

      - pudp_invalidate(): this helper invalidates a huge pud before a split
      happens, so that the invalidated pud entry will make sure no race will
      happen (either with software, like a concurrent zap, or hardware, like
      a/d bit lost).

      - pud_modify(): this helper applies a new pgprot to an existing huge pud
      mapping.

    For more information on why we need these two helpers, please refer to the
    corresponding pmd helpers in the mprotect() code path.

    When at it, simplify the pud_modify()/pmd_modify() comments on shadow stack
    pgtable entries to reference pte_modify() to avoid duplicating the whole
    paragraph three times.

Please let me know if this works for you.

> 
>   
> > +static inline pud_t pud_mkinvalid(pud_t pud)
> > +{
> > +	return pfn_pud(pud_pfn(pud),
> > +		       __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
> 
> 100 characters...

Waiting for an answer on this one, so I'll ignore "100 char" comments for
now, and will address them when I get a clue on what is happening..

> 
> > +}
> > +
> >  static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
> >  
> >  static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> > @@ -834,14 +840,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> >  	pmd_result = __pmd(val);
> >  
> >  	/*
> > -	 * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid:
> > -	 *  1. Marking Write=0 PMDs Dirty=1
> > -	 *  2. Marking Dirty=1 PMDs Write=0
> > -	 *
> > -	 * The first case cannot happen because the _PAGE_CHG_MASK will filter
> > -	 * out any Dirty bit passed in newprot. Handle the second case by
> > -	 * going through the mksaveddirty exercise. Only do this if the old
> > -	 * value was Write=1 to avoid doing this on Shadow Stack PTEs.
> > +	 * Avoid creating shadow stack PMD by accident.  See comment in
> > +	 * pte_modify().
> 
> The changelog is utterly silent about this comment update.

Updated my commit message altogether above; I hope it works.

> 
> >  	 */
> >  	if (oldval & _PAGE_RW)
> >  		pmd_result = pmd_mksaveddirty(pmd_result);
> > @@ -851,6 +851,29 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> >  	return pmd_result;
> >  }
> >  
> > +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> > +{
> > +	pudval_t val = pud_val(pud), oldval = val;
> > +	pud_t pud_result;
> > +
> > +	val &= _HPAGE_CHG_MASK;
> > +	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> > +	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> > +
> > +	pud_result = __pud(val);
> > +
> > +	/*
> > +	 * Avoid creating shadow stack PUD by accident.  See comment in
> > +	 * pte_modify().
> > +	 */
> > +	if (oldval & _PAGE_RW)
> > +		pud_result = pud_mksaveddirty(pud_result);
> > +	else
> > +		pud_result = pud_clear_saveddirty(pud_result);
> > +
> > +	return pud_result;
> > +}
> > +
> >  /*
> >   * mprotect needs to preserve PAT and encryption bits when updating
> >   * vm_page_prot
> > @@ -1389,10 +1412,26 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >  }
> >  #endif
> >  
> > +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> > +		unsigned long address, pud_t *pudp, pud_t pud)
> 
> Random line break alignment.... See documentation.

This is exactly what we do with pmdp_establish() right above.

Would you be fine I keep this just to make pmd/pud look the same?

> 
> > +{
> > +	page_table_check_pud_set(vma->vm_mm, pudp, pud);
> > +	if (IS_ENABLED(CONFIG_SMP)) {
> > +		return xchg(pudp, pud);
> > +	} else {
> > +		pud_t old = *pudp;
> > +		WRITE_ONCE(*pudp, pud);
> 
> Lacks a newline between variable declaration and code.
> 
> But seriously, why optimizing for !SMP? That's a pointless exercise and
> a guarantee for bitrot.

So far it looks still reasonable to me if it is there in pmd.  If I remove
it, people can complain again on "hey, we have this /optimization/ in pmd,
why you removed it in pud?".  No end..

So again.. it's the same to pmd ones.  Either we change nothing, or we
change both.  I don't want to expand this series to more than it wants to
do.. would you mind I keep it until someone justifies the change to modify
both?

> 
> > +		return old;
> > +	}
> > +}
> > +
> >  #define __HAVE_ARCH_PMDP_INVALIDATE_AD
> >  extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
> >  				unsigned long address, pmd_t *pmdp);
> >  
> > +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
> > +		      pud_t *pudp);
> 
> While 'extern' is not required, please keep the file style consistent
> and use the 100 characters...
> 
> > --- a/arch/x86/mm/pgtable.c
> > +++ b/arch/x86/mm/pgtable.c
> > @@ -641,6 +641,18 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
> >  }
> >  #endif
> >  
> > +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> > +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> > +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
> > +		     pud_t *pudp)
> > +{
> > +	VM_WARN_ON_ONCE(!pud_present(*pudp));
> > +	pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
> > +	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
> > +	return old;
> 
> Your keyboard clearly lacks a newline key ...

This is also the same, that pmdp_invalidate() coded it like exactly that.

In general, I prefer keeping the pmd/pud helpers look the same if you won't
disagree as of now, all over the places.

I know that it might even be better to clean up everything, get everything
reviewed first on pmd changes, then clone that to pud.  That might be
cleaner indeed.  But it adds much fuss all over the places, and even with
this series I got stuck for months.. and so far I haven't yet post what I
really wanted to post (huge pfnmaps).  I sincerely hope we can move forward
with this and then clean things up later (none of them are major issues;
all trivial details so far, IMHO, so nothing I see go severely wrong yet),
and then the cleanup will need to be justified one by one on each spot.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 5/7] mm/x86: arch_check_zapped_pud()
  2024-08-08 15:49     ` Peter Xu
@ 2024-08-08 20:45       ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2024-08-08 20:45 UTC (permalink / raw)
  To: Peter Xu, Thomas Gleixner
  Cc: James Houghton, Dave Hansen, linux-mm, Christophe Leroy,
	Dave Jiang, x86, Hugh Dickins, Matthew Wilcox, Ingo Molnar,
	Huang Ying, Rik van Riel, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Dan Williams, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, linux-kernel, Aneesh Kumar K . V,
	Andrew Morton, Rick P Edgecombe, Mel Gorman

>>> +void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
>>> +{
>>> +	/* See note in arch_check_zapped_pte() */
>>> +	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
>>> +			pud_shstk(pud));
>>
>> Please get rid of the line break. You have 100 characters.
> 
> Coding-style.rst still tells me 80 here:
> 
>          The preferred limit on the length of a single line is 80 columns.
> 
>          Statements longer than 80 columns should be broken into sensible chunks,
>          unless exceeding 80 columns significantly increases readability and does
>          not hide information.
> 
> Maybe this just changed very recently so even not in mm-unstable?
> 
> I'll fix the two line-wrap in this patch anyway, as I figured these two
> cases even didn't hit 80..  probably because I used fill-column=75 locally..
> 
> But still I'll probably need to figure it out for other spots.  Please help
> me to justify.

My interpretation is (the doc is not completely clear to me as well, but 
checkpatch.pl hardcodes the max_line_length=100) that we can happily use 
up to 100 chars.

I also tend to stay within 80 chars, unless really reasonable. Years of 
Linux kernel hacking really taught my inner self to not do it.

Here I would agree that having the VM_WARN_ON_ONCE in a single would aid 
readability.

An example where 100 chars are likely a bad idea would be when nesting 
that deeply such that most lines start exceeding 80 chars. We should 
rather fix the code then -- like the coding style spells out :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-08 15:33   ` Sean Christopherson
@ 2024-08-08 21:21     ` Peter Xu
  2024-08-08 21:31       ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-08 21:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, kvm, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dan Williams, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	David Rientjes, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, linux-kernel, Aneesh Kumar K . V,
	Paolo Bonzini, Andrew Morton, Rick P Edgecombe, Mel Gorman

Hi, Sean,

On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> On Wed, Aug 07, 2024, Peter Xu wrote:
> > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > change_pmd_range").
> > 
> > At that time, the issue was that NUMA balancing can be applied on a huge
> > range of VM memory, even if nothing was populated.  The notification can be
> > avoided in this case if no valid pmd detected, which includes either THP or
> > a PTE pgtable page.
> > 
> > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > mmu notifications even on PUD entries properly.  mprotect() is currently
> > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > mappings already), this is the start to fix it.
> > 
> > To fix that, this patch proposes to push such notifications to the PUD
> > layers.
> > 
> > There is risk on regressing the problem Rik wanted to resolve before, but I
> > think it shouldn't really happen, and I still chose this solution because
> > of a few reasons:
> > 
> >   1) Consider a large VM that should definitely contain more than GBs of
> >   memory, it's highly likely that PUDs are also none.  In this case there
> 
> I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> none?

I did mean the original wordings.

Note that in the previous case Rik worked on, it's about a mostly empty VM
got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
hope that when the numa hint applies on any part of the unpopulated guest
memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
mapping as long as the guest memory is not backed by DAX (since only DAX
supports 1G huge pud so far, while hugetlb has its own path here in
mprotect, so it must be things like anon or shmem), but a PUD entry that
contains pmd pgtables.  For that part, I was trying to justify "no pmd
pgtable installed" with the fact that "a large VM that should definitely
contain more than GBs of memory", it means the PUD range should hopefully
never been accessed, so even the pmd pgtable entry should be missing.

With that, we should hopefully keep avoiding mmu notifications after this
patch, just like it used to be when done in pmd layers.

> 
> >   will have no regression.
> > 
> >   2) KVM has evolved a lot over the years to get rid of rmap walks, which
> >   might be the major cause of the previous soft-lockup.  At least TDP MMU
> >   already got rid of rmap as long as not nested (which should be the major
> >   use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
> >   pgtable (e.g. EPT on x86), the invalidation of a full empty region in
> >   most cases could be pretty fast now, comparing to 2014.
> 
> The TDP MMU will indeed be a-ok.  It only zaps leaf SPTEs in response to
> mmu_notifier invalidations, and checks NEED_RESCHED after processing each SPTE,
> i.e. KVM won't zap an entire PUD and get stuck processing all its children.
> 
> I doubt the shadow MMU will fair much better than it did years ago though, AFAICT
> the relevant code hasn't changed.  E.g. when zapping a large range in response to
> an mmu_notifier invalidation, KVM never yields even if blocking is allowed.  That 
> said, it is stupidly easy to fix the soft lockup problem in the shadow MMU.  KVM
> already has an rmap walk path that plays nice with NEED_RESCHED *and* zaps rmaps,
> but because of how things grew organically over the years, KVM never adopted the
> cond_resched() logic for the mmu_notifier path.
> 
> As a bonus, now the .change_pte() is gone, the only other usage of x86's
> kvm_handle_gfn_range() is for the aging mmu_notifiers, and I want to move those
> to their own flow too[*], i.e. kvm_handle_gfn_range() in its current form can
> be removed entirely.
> 
> I'll post a separate series, I don't think it needs to block this work, and I'm
> fairly certain I can get this done for 6.12 (shouldn't be a large or scary series,
> though I may tack on my lockless aging idea as an RFC).

Great, and thanks for all these information! Glad to know.

I guess it makes me feel more confident that this patch shouldn't have any
major side effect at least on KVM side.

Thanks,

> 
> https://lore.kernel.org/all/Zo137P7BFSxAutL2@google.com
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-08 21:21     ` Peter Xu
@ 2024-08-08 21:31       ` Sean Christopherson
  2024-08-08 21:47         ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2024-08-08 21:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, kvm, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dan Williams, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	David Rientjes, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, linux-kernel, Aneesh Kumar K . V,
	Paolo Bonzini, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Thu, Aug 08, 2024, Peter Xu wrote:
> Hi, Sean,
> 
> On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > change_pmd_range").
> > > 
> > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > range of VM memory, even if nothing was populated.  The notification can be
> > > avoided in this case if no valid pmd detected, which includes either THP or
> > > a PTE pgtable page.
> > > 
> > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > mappings already), this is the start to fix it.
> > > 
> > > To fix that, this patch proposes to push such notifications to the PUD
> > > layers.
> > > 
> > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > think it shouldn't really happen, and I still chose this solution because
> > > of a few reasons:
> > > 
> > >   1) Consider a large VM that should definitely contain more than GBs of
> > >   memory, it's highly likely that PUDs are also none.  In this case there
> > 
> > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > none?
> 
> I did mean the original wordings.
> 
> Note that in the previous case Rik worked on, it's about a mostly empty VM
> got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> hope that when the numa hint applies on any part of the unpopulated guest
> memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> mapping as long as the guest memory is not backed by DAX (since only DAX
> supports 1G huge pud so far, while hugetlb has its own path here in
> mprotect, so it must be things like anon or shmem), but a PUD entry that
> contains pmd pgtables.  For that part, I was trying to justify "no pmd
> pgtable installed" with the fact that "a large VM that should definitely
> contain more than GBs of memory", it means the PUD range should hopefully
> never been accessed, so even the pmd pgtable entry should be missing.

Ah, now I get what you were saying.

Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
that walk is done without ever yielding, which I suspect is the source of the
soft lockups of yore.

And there's no way around that conundrum (walking rmaps), at least not without a
major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
through the rmaps.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-08 21:31       ` Sean Christopherson
@ 2024-08-08 21:47         ` Peter Xu
  2024-08-08 22:45           ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2024-08-08 21:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: James Houghton, kvm, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dan Williams, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	David Rientjes, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, linux-kernel, Aneesh Kumar K . V,
	Paolo Bonzini, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Thu, Aug 08, 2024 at 02:31:19PM -0700, Sean Christopherson wrote:
> On Thu, Aug 08, 2024, Peter Xu wrote:
> > Hi, Sean,
> > 
> > On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > > change_pmd_range").
> > > > 
> > > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > > range of VM memory, even if nothing was populated.  The notification can be
> > > > avoided in this case if no valid pmd detected, which includes either THP or
> > > > a PTE pgtable page.
> > > > 
> > > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > > mappings already), this is the start to fix it.
> > > > 
> > > > To fix that, this patch proposes to push such notifications to the PUD
> > > > layers.
> > > > 
> > > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > > think it shouldn't really happen, and I still chose this solution because
> > > > of a few reasons:
> > > > 
> > > >   1) Consider a large VM that should definitely contain more than GBs of
> > > >   memory, it's highly likely that PUDs are also none.  In this case there
> > > 
> > > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > > none?
> > 
> > I did mean the original wordings.
> > 
> > Note that in the previous case Rik worked on, it's about a mostly empty VM
> > got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> > hope that when the numa hint applies on any part of the unpopulated guest
> > memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> > mapping as long as the guest memory is not backed by DAX (since only DAX
> > supports 1G huge pud so far, while hugetlb has its own path here in
> > mprotect, so it must be things like anon or shmem), but a PUD entry that
> > contains pmd pgtables.  For that part, I was trying to justify "no pmd
> > pgtable installed" with the fact that "a large VM that should definitely
> > contain more than GBs of memory", it means the PUD range should hopefully
> > never been accessed, so even the pmd pgtable entry should be missing.
> 
> Ah, now I get what you were saying.
> 
> Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
> empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
> the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
> that walk is done without ever yielding, which I suspect is the source of the
> soft lockups of yore.
> 
> And there's no way around that conundrum (walking rmaps), at least not without a
> major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
> L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
> through the rmaps.

I think the hope here is when the whole PUDs being hinted are empty without
pgtable installed, there'll be no mmu notifier to be kicked off at all.

To be explicit, I meant after this patch applied, the pud loop for numa
hints look like this:

        FOR_EACH_PUD() {
                ...
                if (pud_none(pud))
                        continue;

                if (!range.start) {
                        mmu_notifier_range_init(&range,
                                                MMU_NOTIFY_PROTECTION_VMA, 0,
                                                vma->vm_mm, addr, end);
                        mmu_notifier_invalidate_range_start(&range);
                }
                ...
        }

So the hope is that pud_none() is always true for the hinted area (just
like it used to be when pmd_none() can be hopefully true always), then we
skip the mmu notifier as a whole (including KVM's)!

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-08 21:47         ` Peter Xu
@ 2024-08-08 22:45           ` Sean Christopherson
  0 siblings, 0 replies; 28+ messages in thread
From: Sean Christopherson @ 2024-08-08 22:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, kvm, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dan Williams, Dave Jiang, x86, Hugh Dickins,
	Matthew Wilcox, Ingo Molnar, Huang Ying, Rik van Riel,
	David Rientjes, Nicholas Piggin, Borislav Petkov,
	Kirill A . Shutemov, Thomas Gleixner, Vlastimil Babka,
	Oscar Salvador, linuxppc-dev, linux-kernel, Aneesh Kumar K . V,
	Paolo Bonzini, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Thu, Aug 08, 2024, Peter Xu wrote:
> On Thu, Aug 08, 2024 at 02:31:19PM -0700, Sean Christopherson wrote:
> > On Thu, Aug 08, 2024, Peter Xu wrote:
> > > Hi, Sean,
> > > 
> > > On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > > > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > > > change_pmd_range").
> > > > > 
> > > > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > > > range of VM memory, even if nothing was populated.  The notification can be
> > > > > avoided in this case if no valid pmd detected, which includes either THP or
> > > > > a PTE pgtable page.
> > > > > 
> > > > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > > > mappings already), this is the start to fix it.
> > > > > 
> > > > > To fix that, this patch proposes to push such notifications to the PUD
> > > > > layers.
> > > > > 
> > > > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > > > think it shouldn't really happen, and I still chose this solution because
> > > > > of a few reasons:
> > > > > 
> > > > >   1) Consider a large VM that should definitely contain more than GBs of
> > > > >   memory, it's highly likely that PUDs are also none.  In this case there
> > > > 
> > > > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > > > none?
> > > 
> > > I did mean the original wordings.
> > > 
> > > Note that in the previous case Rik worked on, it's about a mostly empty VM
> > > got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> > > hope that when the numa hint applies on any part of the unpopulated guest
> > > memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> > > mapping as long as the guest memory is not backed by DAX (since only DAX
> > > supports 1G huge pud so far, while hugetlb has its own path here in
> > > mprotect, so it must be things like anon or shmem), but a PUD entry that
> > > contains pmd pgtables.  For that part, I was trying to justify "no pmd
> > > pgtable installed" with the fact that "a large VM that should definitely
> > > contain more than GBs of memory", it means the PUD range should hopefully
> > > never been accessed, so even the pmd pgtable entry should be missing.
> > 
> > Ah, now I get what you were saying.
> > 
> > Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
> > empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
> > the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
> > that walk is done without ever yielding, which I suspect is the source of the
> > soft lockups of yore.
> > 
> > And there's no way around that conundrum (walking rmaps), at least not without a
> > major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
> > L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
> > through the rmaps.
> 
> I think the hope here is when the whole PUDs being hinted are empty without
> pgtable installed, there'll be no mmu notifier to be kicked off at all.
> 
> To be explicit, I meant after this patch applied, the pud loop for numa
> hints look like this:
> 
>         FOR_EACH_PUD() {
>                 ...
>                 if (pud_none(pud))
>                         continue;
> 
>                 if (!range.start) {
>                         mmu_notifier_range_init(&range,
>                                                 MMU_NOTIFY_PROTECTION_VMA, 0,
>                                                 vma->vm_mm, addr, end);
>                         mmu_notifier_invalidate_range_start(&range);
>                 }
>                 ...
>         }
> 
> So the hope is that pud_none() is always true for the hinted area (just
> like it used to be when pmd_none() can be hopefully true always), then we
> skip the mmu notifier as a whole (including KVM's)!

Gotcha, that makes sense.  Too many page tables flying around :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit
  2024-08-08 14:54     ` Peter Xu
@ 2024-08-09 12:08       ` Thomas Gleixner
  2024-08-09 13:53         ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2024-08-09 12:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dave Jiang, x86, Hugh Dickins, Matthew Wilcox,
	Ingo Molnar, Huang Ying, Rik van Riel, Nicholas Piggin,
	Borislav Petkov, Kirill A . Shutemov, Dan Williams,
	Vlastimil Babka, Oscar Salvador, linuxppc-dev, linux-kernel,
	Aneesh Kumar K . V, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Thu, Aug 08 2024 at 10:54, Peter Xu wrote:
> On Thu, Aug 08, 2024 at 12:22:38AM +0200, Thomas Gleixner wrote:
>> On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
>> > An entry should be reported as PUD leaf even if it's PROT_NONE, in which
>> > case PRESENT bit isn't there. I hit bad pud without this when testing dax
>> > 1G on zapping a PROT_NONE PUD.
>> 
>> That does not qualify as a change log. What you hit is irrelevant unless
>> you explain the actual underlying problem. See Documentation/process/
>> including the TIP documentation.
>
> Firstly, thanks a lot for the reviews.
>
> I thought the commit message explained exactly what is the underlying
> problem, no?
>
> The problem is even if PROT_NONE, as long as the PSE bit is set on the PUD
> it should be treated as a PUD leaf.

Sure. But why should it be treated so.

> Currently, the code will return pud_leaf()==false for those PROT_NONE
> PUD entries, and IMHO that is wrong.

Your humble opinion is fine, but hardly a technical argument.

> This patch wants to make it right.  I still think that's mostly what I put
> there in the commit message..
>
> Would you please suggest something so I can try to make it better,
> otherwise?  Or it'll be helpful too if you could point out which part of
> the two documentations I should reference.

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog

  A good structure is to explain the context, the problem and the
  solution in separate paragraphs and this order

>> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> > index e39311a89bf4..a2a3bd4c1bda 100644
>> > --- a/arch/x86/include/asm/pgtable.h
>> > +++ b/arch/x86/include/asm/pgtable.h
>> > @@ -1078,8 +1078,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
>> >  #define pud_leaf pud_leaf
>> >  static inline bool pud_leaf(pud_t pud)
>> >  {
>> > -	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
>> > -		(_PAGE_PSE | _PAGE_PRESENT);
>> > +	return pud_val(pud) & _PAGE_PSE;
>> >  }
>> 
>> And the changelog does not explain why this change is not affecting any
>> existing user of pud_leaf().
>
> That's what I want to do: I want to affect them..

Fine. Just the change log does not tell me what the actual problem is
("I hit something" does not qualify) and "should be reported" is not
helpful either as it does not explain anything

> And IMHO it's mostly fine before because mprotect() is broken with 1g
> anyway, and I guess nobody managed to populate any pud entry with PROT_NONE
> on dax 1g before, and that's what this whole series is trying to fix.

Again your humble opinion matters, but technical facts and analysis
matter way more.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit
  2024-08-09 12:08       ` Thomas Gleixner
@ 2024-08-09 13:53         ` Peter Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Xu @ 2024-08-09 13:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: James Houghton, David Hildenbrand, Dave Hansen, linux-mm,
	Christophe Leroy, Dave Jiang, x86, Hugh Dickins, Matthew Wilcox,
	Ingo Molnar, Huang Ying, Rik van Riel, Nicholas Piggin,
	Borislav Petkov, Kirill A . Shutemov, Dan Williams,
	Vlastimil Babka, Oscar Salvador, linuxppc-dev, linux-kernel,
	Aneesh Kumar K . V, Andrew Morton, Rick P Edgecombe, Mel Gorman

On Fri, Aug 09, 2024 at 02:08:28PM +0200, Thomas Gleixner wrote:
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> 
>   A good structure is to explain the context, the problem and the
>   solution in separate paragraphs and this order

I'll try to follow, thanks.

[...]

> > And IMHO it's mostly fine before because mprotect() is broken with 1g
> > anyway, and I guess nobody managed to populate any pud entry with PROT_NONE
> > on dax 1g before, and that's what this whole series is trying to fix.
> 
> Again your humble opinion matters, but technical facts and analysis
> matter way more.

All the rest comments in the reply were about "why it's a PUD leaf".  So
let me reply in one shot.

Referring to pXd_leaf() documentation in linux/pgtable.h:

/*
 * pXd_leaf() is the API to check whether a pgtable entry is a huge page
 * mapping.  It should work globally across all archs, without any
 * dependency on CONFIG_* options.  For architectures that do not support
 * huge mappings on specific levels, below fallbacks will be used.
 *
 * A leaf pgtable entry should always imply the following:
 *
 * - It is a "present" entry.  IOW, before using this API, please check it
 *   with pXd_present() first. NOTE: it may not always mean the "present
 *   bit" is set.  For example, PROT_NONE entries are always "present".
 *
 * - It should _never_ be a swap entry of any type.  Above "present" check
 *   should have guarded this, but let's be crystal clear on this.
 *
 * - It should contain a huge PFN, which points to a huge page larger than
 *   PAGE_SIZE of the platform.  The PFN format isn't important here.
 *
 * - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(),
 *   pXd_devmap(), or hugetlb mappings).
 */

It explicitly stated that PROT_NONE should be treated as a present entry,
and also a leaf. The document is for pXd_leaf(), so it should cover puds
too.  In this specific case of the zapping path, it's only possible it's a
DAX 1G thp.  But pud_leaf() should work for hugetlb too, for example, when
PROT_NONE applied on top of a 1G hugetlb with PSE set.

Unfortunately, I wrote this document in 64078b3d57.. so that's also another
way of saying "my humble opinion".. it's just nobody disagreed so far, and
please shoot if you see any issue out of it.

IOW, I don't think we must define pXd_leaf() like this - we used to define
pXd_leaf() to cover migration entries at least on x86, for example. But per
my own past mm experience, the current way is the right thing to do to make
everything much easier and less error prone.  Sorry, I can't get rid of
"IMHO" here.

Another example of "we can define pXd_leaf() in other ways" is I believe
for PPC 8XX series it's possible to make special use of pmd_leaf() by
allowing pmd_leaf() to return true even for two continuous pte pgtable
covering 8MB memory.  But that will be an extremely special use of
pmd_leaf() even if it comes, maybe worth an update above when it happens,
and it'll only be used by powerpc not any other arch.  It won't happen if
we want to drop 8MB support, though.

So in short, I don't think there's a 100% correct "technical" answer of
saying "how to define pxx_leaf()"; things just keep evolving, and "humble
opinions" keeps coming with some good reasons.  Hope that answers the
question to some extent.

Taking all things into account, I wonder whether below enriched commit
message would get me closer to your ACK on this, trying to follow the rule
you referenced on the order of how context/problem/solution should be
ordered and addressed:

    When working on mprotect() on 1G dax entries, I hit an zap bad pud
    error when zapping a huge pud that is with PROT_NONE permission.

    Here the problem is x86's pud_leaf() requires both PRESENT and PSE bits
    set to report a pud entry as a leaf, but that doesn't look right, as
    it's not following the pXd_leaf() definition that we stick with so far,
    where PROT_NONE entries should be reported as leaves.

    To fix it, change x86's pud_leaf() implementation to only check against
    PSE bit to report a leaf, irrelevant of whether PRESENT bit is set.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-08-09 13:54 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07 19:48 [PATCH v4 0/7] mm/mprotect: Fix dax puds Peter Xu
2024-08-07 19:48 ` [PATCH v4 1/7] mm/dax: Dump start address in fault handler Peter Xu
2024-08-07 19:48 ` [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
2024-08-08 15:33   ` Sean Christopherson
2024-08-08 21:21     ` Peter Xu
2024-08-08 21:31       ` Sean Christopherson
2024-08-08 21:47         ` Peter Xu
2024-08-08 22:45           ` Sean Christopherson
2024-08-07 19:48 ` [PATCH v4 3/7] mm/powerpc: Add missing pud helpers Peter Xu
2024-08-07 19:48 ` [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit Peter Xu
2024-08-07 22:22   ` Thomas Gleixner
2024-08-08 14:54     ` Peter Xu
2024-08-09 12:08       ` Thomas Gleixner
2024-08-09 13:53         ` Peter Xu
2024-08-07 19:48 ` [PATCH v4 5/7] mm/x86: arch_check_zapped_pud() Peter Xu
2024-08-07 22:28   ` Thomas Gleixner
2024-08-08 15:49     ` Peter Xu
2024-08-08 20:45       ` David Hildenbrand
2024-08-07 19:48 ` [PATCH v4 6/7] mm/x86: Add missing pud helpers Peter Xu
2024-08-07 22:37   ` Thomas Gleixner
2024-08-08 20:25     ` Peter Xu
2024-08-07 19:48 ` [PATCH v4 7/7] mm/mprotect: fix dax pud handlings Peter Xu
2024-08-07 21:17 ` [PATCH v4 0/7] mm/mprotect: Fix dax puds Andrew Morton
2024-08-07 21:34   ` Peter Xu
2024-08-07 21:44     ` Andrew Morton
2024-08-08 14:34       ` Peter Xu
2024-08-07 21:23 ` Andrew Morton
2024-08-07 21:47   ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).