[RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
       [not found] <20220718120212.3180-1-namit@vmware.com>
@ 2022-07-18 12:01 ` Nadav Amit
  2022-07-19 20:47   ` Peter Xu
  2022-07-20  9:42   ` David Hildenbrand
  2022-07-18 12:02 ` [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages Nadav Amit
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

When userfaultfd makes a PTE writable, it can now change the PTE
directly, in some cases, without going triggering a page-fault first.
Yet, doing so might leave the PTE that was write-unprotected as old and
clean. At least on x86, this would cause a >500 cycles overhead when the
PTE is first accessed.

Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
gets a hint that the page is likely to be used. Avoid changing the PTE
to young and dirty in other cases to avoid excessive writeback and
messing with the page reclamation logic.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
---
 include/linux/mm.h | 2 ++
 mm/mprotect.c      | 9 ++++++++-
 mm/userfaultfd.c   | 8 ++++++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9cc02a7e503b..4afd75ce5875 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 /* Whether this change is for write protecting */
 #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
 #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
+/* Whether to try to mark entries as dirty as they are to be written */
+#define  MM_CP_WILL_NEED		   (1UL << 4)
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 996a97e213ad..34c2dfb68c42 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	bool will_need = cp_flags & MM_CP_WILL_NEED;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 
@@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 				ptent = pte_clear_uffd_wp(ptent);
 			}
 
+			if (will_need)
+				ptent = pte_mkyoung(ptent);
+
 			/*
 			 * In some writable, shared mappings, we might want
 			 * to catch actual write access -- see
@@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 			 */
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
 			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
+			    can_change_pte_writable(vma, addr, ptent)) {
 				ptent = pte_mkwrite(ptent);
+				if (will_need)
+					ptent = pte_mkdirty(ptent);
+			}
 
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			if (pte_needs_flush(oldpte, ptent))
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 954c6980b29f..e0492f5f06a0 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -749,6 +749,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	bool enable_wp = uffd_flags & UFFD_FLAGS_WP;
 	struct vm_area_struct *dst_vma;
 	unsigned long page_mask;
+	unsigned long cp_flags;
 	struct mmu_gather tlb;
 	pgprot_t newprot;
 	int err;
@@ -795,9 +796,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	else
 		newprot = vm_get_page_prot(dst_vma->vm_flags);
 
+	cp_flags = enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE;
+	if (uffd_flags & (UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY))
+		cp_flags |= MM_CP_WILL_NEED;
+
 	tlb_gather_mmu(&tlb, dst_mm);
-	change_protection(&tlb, dst_vma, start, start + len, newprot,
-			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+	change_protection(&tlb, dst_vma, start, start + len, newprot, cp_flags);
 	tlb_finish_mmu(&tlb);
 
 	err = 0;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages
       [not found] <20220718120212.3180-1-namit@vmware.com>
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-19 20:49   ` Peter Xu
  2022-07-18 12:02 ` [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable Nadav Amit
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

When using userfaultfd write-(un)protect ioctl, try to change the PTE to
be writable. This would save a page-fault afterwards.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/userfaultfd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e0492f5f06a0..6013b217e9f3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -799,6 +799,8 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	cp_flags = enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE;
 	if (uffd_flags & (UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY))
 		cp_flags |= MM_CP_WILL_NEED;
+	if (!enable_wp && (uffd_flags & UFFD_FLAGS_WRITE_LIKELY))
+		cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
 
 	tlb_gather_mmu(&tlb, dst_mm);
 	change_protection(&tlb, dst_vma, start, start + len, newprot, cp_flags);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable
       [not found] <20220718120212.3180-1-namit@vmware.com>
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-20 15:19   ` David Hildenbrand
  2022-07-18 12:02 ` [RFC PATCH 04/14] mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE Nadav Amit
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Anonymous pages might have the dirty bit clear, but this should not
prevent mprotect from making them writable if they are exclusive.
Therefore, skip the test whether the page is dirty in this case.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/mprotect.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 34c2dfb68c42..da5b9bf8204f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 
 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
 
-	if (pte_protnone(pte) || !pte_dirty(pte))
+	if (pte_protnone(pte))
 		return false;
 
 	/* Do we need write faults for softdirty tracking? */
@@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
 		page = vm_normal_page(vma, addr, pte);
 		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
 			return false;
-	}
+	} else if (!pte_dirty(pte))
+		return false;
 
 	return true;
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 04/14] mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE
       [not found] <20220718120212.3180-1-namit@vmware.com>
                   ` (2 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible Nadav Amit
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

When MM_CP_TRY_CHANGE_WRITABLE is used, change_pte_range() tries to set
PTEs as writable.

Yet, writable PTEs might still become read-only, due to various
limitations of the logic that determines whether a PTE can become
writable (see can_change_pte_writable()). Anyhow, it is much easier to
keep the writable bit set when MM_CP_TRY_CHANGE_WRITABLE is used than to
first clear it and then figure out whether it can be set again.

Preserve the write-bit when MM_CP_TRY_CHANGE_WRITABLE is used, similarly
to the way it is done with NUMA.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/mprotect.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index da5b9bf8204f..92bfb17dcb8a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -84,6 +84,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 	bool will_need = cp_flags & MM_CP_WILL_NEED;
+	bool try_change_writable = cp_flags & MM_CP_TRY_CHANGE_WRITABLE;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 
@@ -114,7 +115,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
-			bool preserve_write = prot_numa && pte_write(oldpte);
+			bool preserve_write = (prot_numa || try_change_writable) &&
+					       pte_write(oldpte);
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
@@ -190,8 +192,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 			 * example, if a PTE is already dirty and no other
 			 * COW or special handling is required.
 			 */
-			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-			    !pte_write(ptent) &&
+			if (try_change_writable && !pte_write(ptent) &&
 			    can_change_pte_writable(vma, addr, ptent)) {
 				ptent = pte_mkwrite(ptent);
 				if (will_need)
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible
       [not found] <20220718120212.3180-1-namit@vmware.com>
                   ` (3 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 04/14] mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 07/14] mm: do fix spurious page-faults for instruction faults Nadav Amit
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

x86 is capable to avoid TLB flush on clean writable entries.
page_vma_mkclean_one() does not take advantage of this behavior. Adapt
it.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/rmap.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 83172ee0ea35..23997c387858 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -961,17 +961,25 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 
 		address = pvmw->address;
 		if (pvmw->pte) {
-			pte_t entry;
+			pte_t entry, oldpte;
 			pte_t *pte = pvmw->pte;
 
 			if (!pte_dirty(*pte) && !pte_write(*pte))
 				continue;
 
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			entry = ptep_clear_flush(vma, address, pte);
-			entry = pte_wrprotect(entry);
+			oldpte = ptep_modify_prot_start(pvmw->vma, address,
+							pte);
+
+			entry = pte_wrprotect(oldpte);
 			entry = pte_mkclean(entry);
-			set_pte_at(vma->vm_mm, address, pte, entry);
+
+			if (pte_needs_flush(oldpte, entry) ||
+			    mm_tlb_flush_pending(vma->vm_mm))
+				flush_tlb_page(vma, address);
+
+			ptep_modify_prot_commit(vma, address, pte, oldpte,
+						entry);
 			ret = 1;
 		} else {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 07/14] mm: do fix spurious page-faults for instruction faults
       [not found] <20220718120212.3180-1-namit@vmware.com>
                   ` (4 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault Nadav Amit
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

The next patches might cause spurious instruction faults on x86. To
prevent them from occurring too much, call
flush_tlb_fix_spurious_fault() for page-faults on code fetching as well.
The callee is expected to do a full flush, or whatever is necessary to
avoid further TLB flushes.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 31ec3f0071a2..152a47876c36 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4924,7 +4924,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * This still avoids useless tlb flushes for .text page faults
 		 * with threads.
 		 */
-		if (vmf->flags & FAULT_FLAG_WRITE)
+		if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_INSTRUCTION))
 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
 	}
 unlock:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault
       [not found] <20220718120212.3180-1-namit@vmware.com>
                   ` (5 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 07/14] mm: do fix spurious page-faults for instruction faults Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 10/14] x86/mm: introduce relaxed TLB flushes Nadav Amit
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

The next patches introduce relaxed TLB flushes for x86, which would
require a full TLB flush upon spurious page-fault. If a spurious
page-fault occurs on x86, check if the local TLB generation is out of
sync and perform a TLB flush if needed.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/pgtable.h |  4 +++-
 arch/x86/mm/tlb.c              | 17 +++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 44e2d6f1dbaa..1fbdaff1bb7a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1079,7 +1079,9 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
-#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
+extern void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+					 unsigned long address);
+#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
 
 #define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d400b6d9d246..ff3bcc55435e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -955,6 +955,23 @@ static void put_flush_tlb_info(void)
 #endif
 }
 
+void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+				  unsigned long address)
+{
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	u64 mm_tlb_gen = atomic64_read(&vma->vm_mm->context.tlb_gen);
+	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	struct flush_tlb_info *info;
+
+	if (local_tlb_gen == mm_tlb_gen)
+		return;
+
+	preempt_disable();
+	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, 0);
+	flush_tlb_func(info);
+	preempt_enable();
+}
+
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables)
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 10/14] x86/mm: introduce relaxed TLB flushes
       [not found] <20220718120212.3180-1-namit@vmware.com>
                   ` (6 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed Nadav Amit
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Introduce relaxed TLB flushes in x86. When protection is removed from
PTEs (i.e., PTEs become writeable or executable), relaxed TLB flushes
would be used. Relaxed TLB flushes do flush the local TLB, but do not
flush remote TLBs.

If later a spurious page-fault is encountered, and the local TLB
generation is found to be out of sync with the mm's TLB generation, a
full TLB flush takes place to prevent further spurious page-faults from
occurring.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlb.h      | 3 ++-
 arch/x86/include/asm/tlbflush.h | 9 +++++----
 arch/x86/kernel/alternative.c   | 2 +-
 arch/x86/kernel/ldt.c           | 3 ++-
 arch/x86/mm/tlb.c               | 4 ++--
 5 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 1bfe979bb9bc..51c85136f9a8 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables,
+			   tlb->strict);
 }
 
 /*
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 77d4810e5a5d..230cd1d24fe6 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -220,23 +220,24 @@ void flush_tlb_multi(const struct cpumask *cpumask,
 #endif
 
 #define flush_tlb_mm(mm)						\
-		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
+		flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true, true)
 
 #define flush_tlb_range(vma, start, end)				\
 	flush_tlb_mm_range((vma)->vm_mm, start, end,			\
 			   ((vma)->vm_flags & VM_HUGETLB)		\
 				? huge_page_shift(hstate_vma(vma))	\
-				: PAGE_SHIFT, false)
+				: PAGE_SHIFT, false, true)
 
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables);
+				bool freed_tables, bool strict);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
-	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
+	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false,
+			   true);
 }
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e257f6c80372..48945a47fd76 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1099,7 +1099,7 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 	 */
 	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
 			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
-			   PAGE_SHIFT, false);
+			   PAGE_SHIFT, false, true);
 
 	if (func == text_poke_memcpy) {
 		/*
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 525876e7b9f4..7c7bc97324bc 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -372,7 +372,8 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
 	}
 
 	va = (unsigned long)ldt_slot_va(ldt->slot);
-	flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false);
+	flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, PAGE_SHIFT, false,
+			   true);
 }
 
 #else /* !CONFIG_PAGE_TABLE_ISOLATION */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ff3bcc55435e..ec5033d28a97 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -974,7 +974,7 @@ void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables)
+				bool freed_tables, bool strict)
 {
 	struct flush_tlb_info *info;
 	u64 new_tlb_gen;
@@ -1000,7 +1000,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (strict && cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		flush_tlb_multi(mm_cpumask(mm), info);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed
       [not found] <20220718120212.3180-1-namit@vmware.com>
                   ` (7 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 10/14] x86/mm: introduce relaxed TLB flushes Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type Nadav Amit
  10 siblings, 0 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

When checking x86 PTE flags to determine whether a TLB flush is needed,
determine whether a relaxed TLB flush is sufficient. If protection is
added (NX removed or W added), indicate that a relaxed TLB flush would
suffice.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 230cd1d24fe6..4f98735ab07a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -271,18 +271,23 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 	 * dirty/access bit if needed without a fault.
 	 */
 	const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
-					_PAGE_ACCESSED;
+					_PAGE_ACCESSED | _PAGE_RW;
+	const pteval_t flush_on_set = _PAGE_NX;
+	const pteval_t flush_on_set_relaxed = _PAGE_RW;
+	const pteval_t flush_on_clear_relaxed = _PAGE_NX;
 	const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
 					_PAGE_SOFTW3 | _PAGE_SOFTW4;
-	const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
+	const pteval_t flush_on_change = _PAGE_USER | _PAGE_PWT |
 			  _PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
 			  _PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
-			  _PAGE_PKEY_BIT2 | _PAGE_PKEY_BIT3 | _PAGE_NX;
+			  _PAGE_PKEY_BIT2 | _PAGE_PKEY_BIT3;
 	unsigned long diff = oldflags ^ newflags;
 
 	BUILD_BUG_ON(flush_on_clear & software_flags);
 	BUILD_BUG_ON(flush_on_clear & flush_on_change);
 	BUILD_BUG_ON(flush_on_change & software_flags);
+	BUILD_BUG_ON(flush_on_change & flush_on_clear_relaxed);
+	BUILD_BUG_ON(flush_on_change & flush_on_set_relaxed);
 
 	/* Ignore software flags */
 	diff &= ~software_flags;
@@ -301,9 +306,16 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 	if (diff & flush_on_change)
 		return PTE_FLUSH_STRICT;
 
+	if (diff & oldflags & flush_on_clear_relaxed)
+		return PTE_FLUSH_RELAXED;
+
+	if (diff & newflags & flush_on_set_relaxed)
+		return PTE_FLUSH_RELAXED;
+
 	/* Ensure there are no flags that were left behind */
 	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
-	    (diff & ~(flush_on_clear | software_flags | flush_on_change))) {
+	    (diff & ~(flush_on_clear | flush_on_set |
+		      software_flags | flush_on_change))) {
 		VM_WARN_ON_ONCE(1);
 		return PTE_FLUSH_STRICT;
 	}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean
       [not found] <20220718120212.3180-1-namit@vmware.com>
                   ` (8 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  2022-07-18 12:02 ` [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type Nadav Amit
  10 siblings, 0 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

On x86 it is possible to skip a TLB flush when a RW entry become RO and
the PTE is clean. Add logic to detect this case and skip the flush.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f98735ab07a..58c95e36b098 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -271,8 +271,9 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 	 * dirty/access bit if needed without a fault.
 	 */
 	const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
-					_PAGE_ACCESSED | _PAGE_RW;
+					_PAGE_ACCESSED;
 	const pteval_t flush_on_set = _PAGE_NX;
+	const pteval_t flush_on_special = _PAGE_RW;
 	const pteval_t flush_on_set_relaxed = _PAGE_RW;
 	const pteval_t flush_on_clear_relaxed = _PAGE_NX;
 	const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
@@ -302,6 +303,17 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 	if (diff & oldflags & flush_on_clear)
 		return PTE_FLUSH_STRICT;
 
+	/*
+	 * Did any of the 'flush_on_set' flags was set between 'oldflags' and
+	 * 'newflags'?
+	 */
+	if (diff & newflags & flush_on_set)
+		return PTE_FLUSH_STRICT;
+
+	/* On RW->RO, a flush is needed if the old entry is dirty */
+	if ((diff & oldflags & _PAGE_RW) && (oldflags & _PAGE_DIRTY))
+		return PTE_FLUSH_STRICT;
+
 	/* Flush on modified flags. */
 	if (diff & flush_on_change)
 		return PTE_FLUSH_STRICT;
@@ -314,7 +326,7 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
 
 	/* Ensure there are no flags that were left behind */
 	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
-	    (diff & ~(flush_on_clear | flush_on_set |
+	    (diff & ~(flush_on_clear | flush_on_set | flush_on_special |
 		      software_flags | flush_on_change))) {
 		VM_WARN_ON_ONCE(1);
 		return PTE_FLUSH_STRICT;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type
       [not found] <20220718120212.3180-1-namit@vmware.com>
                   ` (9 preceding siblings ...)
  2022-07-18 12:02 ` [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean Nadav Amit
@ 2022-07-18 12:02 ` Nadav Amit
  10 siblings, 0 replies; 33+ messages in thread
From: Nadav Amit @ 2022-07-18 12:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Checking whether PFNs in two PTEs are the same takes surprisingly large
number of instructions. Yet in fact, in most cases the caller to
pte_flush_type() already knows if the PFN was changed. For instance,
mprotect() does not change the PFN, but only modifies the protection
flags.

Add argument to pte_flush_type() to indicate whether the PFN should be
checked. Keep checking it in mm-debug to see if some caller was wrong to
assume the PFN is the same.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/tlbflush.h | 14 ++++++++++----
 include/asm-generic/tlb.h       |  6 ++++--
 mm/huge_memory.c                |  2 +-
 mm/mprotect.c                   |  2 +-
 mm/rmap.c                       |  2 +-
 5 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 58c95e36b098..50349861fdc9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -340,14 +340,17 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
  * whether a strict or relaxed TLB flush is need. It should only be used on
  * userspace PTEs.
  */
-static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
+static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte,
+						   bool check_pfn)
 {
 	/* !PRESENT -> * ; no need for flush */
 	if (!(pte_flags(oldpte) & _PAGE_PRESENT))
 		return PTE_FLUSH_NONE;
 
 	/* PFN changed ; needs flush */
-	if (pte_pfn(oldpte) != pte_pfn(newpte))
+	if (!check_pfn)
+		VM_BUG_ON(pte_pfn(oldpte) != pte_pfn(newpte));
+	else if (pte_pfn(oldpte) != pte_pfn(newpte))
 		return PTE_FLUSH_STRICT;
 
 	/*
@@ -363,14 +366,17 @@ static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
  * huge_pmd_flush_type() checks whether permissions were demoted and require a
  * flush. It should only be used for userspace huge PMDs.
  */
-static inline enum pte_flush_type huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
+static inline enum pte_flush_type huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd,
+						      bool check_pfn)
 {
 	/* !PRESENT -> * ; no need for flush */
 	if (!(pmd_flags(oldpmd) & _PAGE_PRESENT))
 		return PTE_FLUSH_NONE;
 
 	/* PFN changed ; needs flush */
-	if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
+	if (!check_pfn)
+		VM_BUG_ON(pmd_pfn(oldpmd) != pmd_pfn(newpmd));
+	else if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
 		return PTE_FLUSH_STRICT;
 
 	/*
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 07b3eb8caf63..aee9da6cc5d5 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -677,14 +677,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 #endif
 
 #ifndef pte_flush_type
-static inline struct pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
+static inline struct pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte,
+						   bool check_pfn)
 {
 	return PTE_FLUSH_STRICT;
 }
 #endif
 
 #ifndef huge_pmd_flush_type
-static inline bool huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
+static inline bool huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd,
+				       bool check_pfn)
 {
 	return PTE_FLUSH_STRICT;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b32b7da0f6f7..92a7b3ca317f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1818,7 +1818,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 	flush_type = PTE_FLUSH_STRICT;
 	if (!tlb->strict)
-		flush_type = huge_pmd_flush_type(oldpmd, entry);
+		flush_type = huge_pmd_flush_type(oldpmd, entry, false);
 	if (flush_type != PTE_FLUSH_NONE)
 		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE,
 				    flush_type == PTE_FLUSH_STRICT);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cf775f6c8c08..78081d7f4edf 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -204,7 +204,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 
 			flush_type = PTE_FLUSH_STRICT;
 			if (!tlb->strict)
-				flush_type = pte_flush_type(oldpte, ptent);
+				flush_type = pte_flush_type(oldpte, ptent, false);
 			if (flush_type != PTE_FLUSH_NONE)
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE,
 						flush_type == PTE_FLUSH_STRICT);
diff --git a/mm/rmap.c b/mm/rmap.c
index 62f4b2a4f067..63261619b607 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -974,7 +974,7 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 			entry = pte_wrprotect(oldpte);
 			entry = pte_mkclean(entry);
 
-			if (pte_flush_type(oldpte, entry) != PTE_FLUSH_NONE ||
+			if (pte_flush_type(oldpte, entry, false) != PTE_FLUSH_NONE ||
 			    mm_tlb_flush_pending(vma->vm_mm))
 				flush_tlb_page(vma, address);
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
@ 2022-07-19 20:47   ` Peter Xu
  2022-07-20  9:39     ` David Hildenbrand
  2022-07-20  9:42   ` David Hildenbrand
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Xu @ 2022-07-19 20:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, David Hildenbrand, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

On Mon, Jul 18, 2022 at 05:01:59AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When userfaultfd makes a PTE writable, it can now change the PTE
> directly, in some cases, without going triggering a page-fault first.
> Yet, doing so might leave the PTE that was write-unprotected as old and
> clean. At least on x86, this would cause a >500 cycles overhead when the
> PTE is first accessed.
> 
> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
> gets a hint that the page is likely to be used. Avoid changing the PTE
> to young and dirty in other cases to avoid excessive writeback and
> messing with the page reclamation logic.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  include/linux/mm.h | 2 ++
>  mm/mprotect.c      | 9 ++++++++-
>  mm/userfaultfd.c   | 8 ++++++--
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9cc02a7e503b..4afd75ce5875 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  /* Whether this change is for write protecting */
>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
> +/* Whether to try to mark entries as dirty as they are to be written */
> +#define  MM_CP_WILL_NEED		   (1UL << 4)
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 996a97e213ad..34c2dfb68c42 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  
> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  				ptent = pte_clear_uffd_wp(ptent);
>  			}
>  
> +			if (will_need)
> +				ptent = pte_mkyoung(ptent);

For uffd path, UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY are new
internal flags used with or without the new feature bit set.  It means even
with !ACCESS_HINT we'll start to set young bit while we used not to?  Is
that some kind of a light abi change?

I'd suggest we only set will_need if ACCESS_HINT is set.

> +
>  			/*
>  			 * In some writable, shared mappings, we might want
>  			 * to catch actual write access -- see
> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  			 */
>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>  			    !pte_write(ptent) &&
> -			    can_change_pte_writable(vma, addr, ptent))
> +			    can_change_pte_writable(vma, addr, ptent)) {
>  				ptent = pte_mkwrite(ptent);
> +				if (will_need)
> +					ptent = pte_mkdirty(ptent);

Can we make this unconditional?  IOW to cover both:

  (1) When will_need is not set, or
  (2) mprotect() too

David's patch is good in that we merged the unprotect and CoW.  However
that's not complete because the dirty bit ops are missing.

Here IMHO we should have a standalone patch to just add the dirty bit into
this logic when we'll grant write bit.  IMHO it'll make the write+dirty
bits coherent again in all paths.

> +			}
>  
>  			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>  			if (pte_needs_flush(oldpte, ptent))
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 954c6980b29f..e0492f5f06a0 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -749,6 +749,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>  	bool enable_wp = uffd_flags & UFFD_FLAGS_WP;
>  	struct vm_area_struct *dst_vma;
>  	unsigned long page_mask;
> +	unsigned long cp_flags;
>  	struct mmu_gather tlb;
>  	pgprot_t newprot;
>  	int err;
> @@ -795,9 +796,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>  	else
>  		newprot = vm_get_page_prot(dst_vma->vm_flags);
>  
> +	cp_flags = enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE;
> +	if (uffd_flags & (UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY))
> +		cp_flags |= MM_CP_WILL_NEED;
> +
>  	tlb_gather_mmu(&tlb, dst_mm);
> -	change_protection(&tlb, dst_vma, start, start + len, newprot,
> -			  enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
> +	change_protection(&tlb, dst_vma, start, start + len, newprot, cp_flags);
>  	tlb_finish_mmu(&tlb);
>  
>  	err = 0;
> -- 
> 2.25.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages
  2022-07-18 12:02 ` [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages Nadav Amit
@ 2022-07-19 20:49   ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2022-07-19 20:49 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, David Hildenbrand, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

On Mon, Jul 18, 2022 at 05:02:00AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When using userfaultfd write-(un)protect ioctl, try to change the PTE to
> be writable. This would save a page-fault afterwards.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-19 20:47   ` Peter Xu
@ 2022-07-20  9:39     ` David Hildenbrand
  2022-07-20 13:10       ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20  9:39 UTC (permalink / raw)
  To: Peter Xu, Nadav Amit
  Cc: linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 19.07.22 22:47, Peter Xu wrote:
> On Mon, Jul 18, 2022 at 05:01:59AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>>
>> When userfaultfd makes a PTE writable, it can now change the PTE
>> directly, in some cases, without going triggering a page-fault first.
>> Yet, doing so might leave the PTE that was write-unprotected as old and
>> clean. At least on x86, this would cause a >500 cycles overhead when the
>> PTE is first accessed.
>>
>> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
>> gets a hint that the page is likely to be used. Avoid changing the PTE
>> to young and dirty in other cases to avoid excessive writeback and
>> messing with the page reclamation logic.
>>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> ---
>>  include/linux/mm.h | 2 ++
>>  mm/mprotect.c      | 9 ++++++++-
>>  mm/userfaultfd.c   | 8 ++++++--
>>  3 files changed, 16 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 9cc02a7e503b..4afd75ce5875 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>  /* Whether this change is for write protecting */
>>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
>>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
>> +/* Whether to try to mark entries as dirty as they are to be written */
>> +#define  MM_CP_WILL_NEED		   (1UL << 4)
>>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>  					    MM_CP_UFFD_WP_RESOLVE)
>>  
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 996a97e213ad..34c2dfb68c42 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
>>  
>>  	tlb_change_page_size(tlb, PAGE_SIZE);
>>  
>> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>  				ptent = pte_clear_uffd_wp(ptent);
>>  			}
>>  
>> +			if (will_need)
>> +				ptent = pte_mkyoung(ptent);
> 
> For uffd path, UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY are new
> internal flags used with or without the new feature bit set.  It means even
> with !ACCESS_HINT we'll start to set young bit while we used not to?  Is
> that some kind of a light abi change?
> 
> I'd suggest we only set will_need if ACCESS_HINT is set.
> 
>> +
>>  			/*
>>  			 * In some writable, shared mappings, we might want
>>  			 * to catch actual write access -- see
>> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>  			 */
>>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>  			    !pte_write(ptent) &&
>> -			    can_change_pte_writable(vma, addr, ptent))
>> +			    can_change_pte_writable(vma, addr, ptent)) {
>>  				ptent = pte_mkwrite(ptent);
>> +				if (will_need)
>> +					ptent = pte_mkdirty(ptent);
> 
> Can we make this unconditional?  IOW to cover both:
> 
>   (1) When will_need is not set, or
>   (2) mprotect() too
> 
> David's patch is good in that we merged the unprotect and CoW.  However
> that's not complete because the dirty bit ops are missing.
> 
> Here IMHO we should have a standalone patch to just add the dirty bit into
> this logic when we'll grant write bit.  IMHO it'll make the write+dirty
> bits coherent again in all paths.

I'm not sure I follow.

We *surely* don't want to dirty random pages (especially once in the
pagecache/swapcache) simply because we change protection.

Just like we don't set all pages write+dirty in a writable VMA on a read
fault.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
  2022-07-19 20:47   ` Peter Xu
@ 2022-07-20  9:42   ` David Hildenbrand
  2022-07-20 17:36     ` Nadav Amit
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20  9:42 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Xu, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 18.07.22 14:01, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When userfaultfd makes a PTE writable, it can now change the PTE
> directly, in some cases, without going triggering a page-fault first.
> Yet, doing so might leave the PTE that was write-unprotected as old and
> clean. At least on x86, this would cause a >500 cycles overhead when the
> PTE is first accessed.
> 
> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
> gets a hint that the page is likely to be used. Avoid changing the PTE
> to young and dirty in other cases to avoid excessive writeback and
> messing with the page reclamation logic.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> ---
>  include/linux/mm.h | 2 ++
>  mm/mprotect.c      | 9 ++++++++-
>  mm/userfaultfd.c   | 8 ++++++--
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9cc02a7e503b..4afd75ce5875 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  /* Whether this change is for write protecting */
>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
> +/* Whether to try to mark entries as dirty as they are to be written */
> +#define  MM_CP_WILL_NEED		   (1UL << 4)
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 996a97e213ad..34c2dfb68c42 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  
> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  				ptent = pte_clear_uffd_wp(ptent);
>  			}
>  
> +			if (will_need)
> +				ptent = pte_mkyoung(ptent);
> +
>  			/*
>  			 * In some writable, shared mappings, we might want
>  			 * to catch actual write access -- see
> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>  			 */
>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>  			    !pte_write(ptent) &&


Why would we want to check if we can set something writable if it
already *is* writable? That doesn't make sense to me.

> -			    can_change_pte_writable(vma, addr, ptent))
> +			    can_change_pte_writable(vma, addr, ptent)) {
>  				ptent = pte_mkwrite(ptent);
> +				if (will_need)
> +					ptent = pte_mkdirty(ptent);
> +			}

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20  9:39     ` David Hildenbrand
@ 2022-07-20 13:10       ` Peter Xu
  2022-07-20 15:10         ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2022-07-20 13:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On Wed, Jul 20, 2022 at 11:39:23AM +0200, David Hildenbrand wrote:
> On 19.07.22 22:47, Peter Xu wrote:
> > On Mon, Jul 18, 2022 at 05:01:59AM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >>
> >> When userfaultfd makes a PTE writable, it can now change the PTE
> >> directly, in some cases, without going triggering a page-fault first.
> >> Yet, doing so might leave the PTE that was write-unprotected as old and
> >> clean. At least on x86, this would cause a >500 cycles overhead when the
> >> PTE is first accessed.
> >>
> >> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
> >> gets a hint that the page is likely to be used. Avoid changing the PTE
> >> to young and dirty in other cases to avoid excessive writeback and
> >> messing with the page reclamation logic.
> >>
> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Andy Lutomirski <luto@kernel.org>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Peter Xu <peterx@redhat.com>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Yu Zhao <yuzhao@google.com>
> >> Cc: Nick Piggin <npiggin@gmail.com>
> >> ---
> >>  include/linux/mm.h | 2 ++
> >>  mm/mprotect.c      | 9 ++++++++-
> >>  mm/userfaultfd.c   | 8 ++++++--
> >>  3 files changed, 16 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 9cc02a7e503b..4afd75ce5875 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
> >>  /* Whether this change is for write protecting */
> >>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
> >>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
> >> +/* Whether to try to mark entries as dirty as they are to be written */
> >> +#define  MM_CP_WILL_NEED		   (1UL << 4)
> >>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
> >>  					    MM_CP_UFFD_WP_RESOLVE)
> >>  
> >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >> index 996a97e213ad..34c2dfb68c42 100644
> >> --- a/mm/mprotect.c
> >> +++ b/mm/mprotect.c
> >> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> >>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
> >>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> >>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> >> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
> >>  
> >>  	tlb_change_page_size(tlb, PAGE_SIZE);
> >>  
> >> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> >>  				ptent = pte_clear_uffd_wp(ptent);
> >>  			}
> >>  
> >> +			if (will_need)
> >> +				ptent = pte_mkyoung(ptent);
> > 
> > For uffd path, UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY are new
> > internal flags used with or without the new feature bit set.  It means even
> > with !ACCESS_HINT we'll start to set young bit while we used not to?  Is
> > that some kind of a light abi change?
> > 
> > I'd suggest we only set will_need if ACCESS_HINT is set.
> > 
> >> +
> >>  			/*
> >>  			 * In some writable, shared mappings, we might want
> >>  			 * to catch actual write access -- see
> >> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> >>  			 */
> >>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
> >>  			    !pte_write(ptent) &&
> >> -			    can_change_pte_writable(vma, addr, ptent))
> >> +			    can_change_pte_writable(vma, addr, ptent)) {
> >>  				ptent = pte_mkwrite(ptent);
> >> +				if (will_need)
> >> +					ptent = pte_mkdirty(ptent);
> > 
> > Can we make this unconditional?  IOW to cover both:
> > 
> >   (1) When will_need is not set, or
> >   (2) mprotect() too
> > 
> > David's patch is good in that we merged the unprotect and CoW.  However
> > that's not complete because the dirty bit ops are missing.
> > 
> > Here IMHO we should have a standalone patch to just add the dirty bit into
> > this logic when we'll grant write bit.  IMHO it'll make the write+dirty
> > bits coherent again in all paths.
> 
> I'm not sure I follow.
> 
> We *surely* don't want to dirty random pages (especially once in the
> pagecache/swapcache) simply because we change protection.
> 
> Just like we don't set all pages write+dirty in a writable VMA on a read
> fault.

IMO unmprotect (in generic mprotect form or uffd form) has a stronger sign
of page being written, unlike read faults, as many of them happen because
page being written and message generated.

But yeah you have a point too, maybe we shouldn't assume such a condition.
Especially as long as we won't set write=1 without soft-dirty tracking
enabled, I think it should be safe.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 13:10       ` Peter Xu
@ 2022-07-20 15:10         ` David Hildenbrand
  2022-07-20 19:15           ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 20.07.22 15:10, Peter Xu wrote:
> On Wed, Jul 20, 2022 at 11:39:23AM +0200, David Hildenbrand wrote:
>> On 19.07.22 22:47, Peter Xu wrote:
>>> On Mon, Jul 18, 2022 at 05:01:59AM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>>
>>>> When userfaultfd makes a PTE writable, it can now change the PTE
>>>> directly, in some cases, without going triggering a page-fault first.
>>>> Yet, doing so might leave the PTE that was write-unprotected as old and
>>>> clean. At least on x86, this would cause a >500 cycles overhead when the
>>>> PTE is first accessed.
>>>>
>>>> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
>>>> gets a hint that the page is likely to be used. Avoid changing the PTE
>>>> to young and dirty in other cases to avoid excessive writeback and
>>>> messing with the page reclamation logic.
>>>>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Andy Lutomirski <luto@kernel.org>
>>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Will Deacon <will@kernel.org>
>>>> Cc: Yu Zhao <yuzhao@google.com>
>>>> Cc: Nick Piggin <npiggin@gmail.com>
>>>> ---
>>>>  include/linux/mm.h | 2 ++
>>>>  mm/mprotect.c      | 9 ++++++++-
>>>>  mm/userfaultfd.c   | 8 ++++++--
>>>>  3 files changed, 16 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 9cc02a7e503b..4afd75ce5875 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>>>  /* Whether this change is for write protecting */
>>>>  #define  MM_CP_UFFD_WP                     (1UL << 2) /* do wp */
>>>>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
>>>> +/* Whether to try to mark entries as dirty as they are to be written */
>>>> +#define  MM_CP_WILL_NEED		   (1UL << 4)
>>>>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>>  					    MM_CP_UFFD_WP_RESOLVE)
>>>>  
>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>> index 996a97e213ad..34c2dfb68c42 100644
>>>> --- a/mm/mprotect.c
>>>> +++ b/mm/mprotect.c
>>>> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>>>  	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>>>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>>>  	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>>> +	bool will_need = cp_flags & MM_CP_WILL_NEED;
>>>>  
>>>>  	tlb_change_page_size(tlb, PAGE_SIZE);
>>>>  
>>>> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>>>  				ptent = pte_clear_uffd_wp(ptent);
>>>>  			}
>>>>  
>>>> +			if (will_need)
>>>> +				ptent = pte_mkyoung(ptent);
>>>
>>> For uffd path, UFFD_FLAGS_ACCESS_LIKELY|UFFD_FLAGS_WRITE_LIKELY are new
>>> internal flags used with or without the new feature bit set.  It means even
>>> with !ACCESS_HINT we'll start to set young bit while we used not to?  Is
>>> that some kind of a light abi change?
>>>
>>> I'd suggest we only set will_need if ACCESS_HINT is set.
>>>
>>>> +
>>>>  			/*
>>>>  			 * In some writable, shared mappings, we might want
>>>>  			 * to catch actual write access -- see
>>>> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>>>  			 */
>>>>  			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>>>  			    !pte_write(ptent) &&
>>>> -			    can_change_pte_writable(vma, addr, ptent))
>>>> +			    can_change_pte_writable(vma, addr, ptent)) {
>>>>  				ptent = pte_mkwrite(ptent);
>>>> +				if (will_need)
>>>> +					ptent = pte_mkdirty(ptent);
>>>
>>> Can we make this unconditional?  IOW to cover both:
>>>
>>>   (1) When will_need is not set, or
>>>   (2) mprotect() too
>>>
>>> David's patch is good in that we merged the unprotect and CoW.  However
>>> that's not complete because the dirty bit ops are missing.
>>>
>>> Here IMHO we should have a standalone patch to just add the dirty bit into
>>> this logic when we'll grant write bit.  IMHO it'll make the write+dirty
>>> bits coherent again in all paths.
>>
>> I'm not sure I follow.
>>
>> We *surely* don't want to dirty random pages (especially once in the
>> pagecache/swapcache) simply because we change protection.
>>
>> Just like we don't set all pages write+dirty in a writable VMA on a read
>> fault.
> 
> IMO unmprotect (in generic mprotect form or uffd form) has a stronger sign
> of page being written, unlike read faults, as many of them happen because
> page being written and message generated.

I'm sorry, but I am very skeptical about such statements. I don't buy it.

> 
> But yeah you have a point too, maybe we shouldn't assume such a condition.
> Especially as long as we won't set write=1 without soft-dirty tracking
> enabled, I think it should be safe.

For pagecache pages it may as well be *plain wrong* to bypass the write
fault handler and simply mark pages dirty+map them writable.

Please, let's keep this protection change handler here as simple as
possible and *not* try to replicate the whole write fault handler logic
in here by relying on statements as above.

If we try optimizing for corner cases by making the implementation here
overly complicated, then we are clearly doing something wrong.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable
  2022-07-18 12:02 ` [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable Nadav Amit
@ 2022-07-20 15:19   ` David Hildenbrand
  2022-07-20 17:25     ` Nadav Amit
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20 15:19 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Xu, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 18.07.22 14:02, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Anonymous pages might have the dirty bit clear, but this should not
> prevent mprotect from making them writable if they are exclusive.
> Therefore, skip the test whether the page is dirty in this case.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  mm/mprotect.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 34c2dfb68c42..da5b9bf8204f 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>  
>  	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>  
> -	if (pte_protnone(pte) || !pte_dirty(pte))
> +	if (pte_protnone(pte))
>  		return false;
>  
>  	/* Do we need write faults for softdirty tracking? */
> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>  		page = vm_normal_page(vma, addr, pte);
>  		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>  			return false;
> -	}
> +	} else if (!pte_dirty(pte))
> +		return false;
>  
>  	return true;
>  }

When I wrote that code, I was wondering how often that would actually
happen in practice -- and if we care about optimizing that. Do you have
a gut feeling in which scenarios this would happen and if we care?

If the page is in the swapcache and was swapped out, you'd be requiring
a writeback even though nobody modified the page and possibly isn't
going to do so in the near future.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable
  2022-07-20 15:19   ` David Hildenbrand
@ 2022-07-20 17:25     ` Nadav Amit
  2022-07-21  7:45       ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Nadav Amit @ 2022-07-20 17:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On Jul 20, 2022, at 8:19 AM, David Hildenbrand <david@redhat.com> wrote:

> On 18.07.22 14:02, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Anonymous pages might have the dirty bit clear, but this should not
>> prevent mprotect from making them writable if they are exclusive.
>> Therefore, skip the test whether the page is dirty in this case.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> mm/mprotect.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 34c2dfb68c42..da5b9bf8204f 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> 
>> 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>> 
>> -	if (pte_protnone(pte) || !pte_dirty(pte))
>> +	if (pte_protnone(pte))
>> 		return false;
>> 
>> 	/* Do we need write faults for softdirty tracking? */
>> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> 		page = vm_normal_page(vma, addr, pte);
>> 		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>> 			return false;
>> -	}
>> +	} else if (!pte_dirty(pte))
>> +		return false;
>> 
>> 	return true;
>> }
> 
> When I wrote that code, I was wondering how often that would actually
> happen in practice -- and if we care about optimizing that. Do you have
> a gut feeling in which scenarios this would happen and if we care?
> 
> If the page is in the swapcache and was swapped out, you'd be requiring
> a writeback even though nobody modified the page and possibly isn't
> going to do so in the near future.

So here is my due diligence: I did not really encounter a scenario in which
it showed up. When I looked at your code, I assumed this was an oversight
and not a thoughtful decision. For me the issue is more of the discrepancy
between how a certain page is handled before and after it was pages out.

The way that I see it, there is a tradeoff in the way dirty-bit should
be handled:
(1) Writable-clean PTEs introduce some non-negligible overhead.
(2) Marking a PTE dirty speculatively would require a write back.

… But this tradeoff should not affect whether a PTE is writable, i.e.,
mapping the PTE as writable-clean should not cause a writeback. In other
words, if you are concerned about unnecessary writebacks, which I think is a
fair concern, then do not set the dirty-bit. In a later patch I try to avoid
TLB flushes on clean-writable entries that are write-protected.

So I do not think that the writeback you mentioned should be a real issue.
Yet if you think that using the fact that the page is not-dirty is a good
hueristics to avoid future TLB flushes (for P->NP; as I said there is a
solution for RW->RO), or if you are concerned about the cost of
vm_normal_page(), perhaps those are valid concerned (although I do not think
so).

--

[ Regarding (1): After some discussions with Peter and reading more code, I
thought at some point that perhaps avoiding having writable-clean PTE as
much as possible makes sense [*], since setting the dirty-bit costs ~550
cycles and a page fault is not a lot more than 1000. But with all the
mitigations (and after adding IBRS for retbless) page-fault entry is kind of
expensive. 

[*] At least on x86 ]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20  9:42   ` David Hildenbrand
@ 2022-07-20 17:36     ` Nadav Amit
  2022-07-20 18:00       ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Nadav Amit @ 2022-07-20 17:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On Jul 20, 2022, at 2:42 AM, David Hildenbrand <david@redhat.com> wrote:

> ⚠ External Email
> 
> On 18.07.22 14:01, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> When userfaultfd makes a PTE writable, it can now change the PTE
>> directly, in some cases, without going triggering a page-fault first.
>> Yet, doing so might leave the PTE that was write-unprotected as old and
>> clean. At least on x86, this would cause a >500 cycles overhead when the
>> PTE is first accessed.
>> 
>> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
>> gets a hint that the page is likely to be used. Avoid changing the PTE
>> to young and dirty in other cases to avoid excessive writeback and
>> messing with the page reclamation logic.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Nick Piggin <npiggin@gmail.com>
>> ---
>> include/linux/mm.h | 2 ++
>> mm/mprotect.c | 9 ++++++++-
>> mm/userfaultfd.c | 8 ++++++--
>> 3 files changed, 16 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 9cc02a7e503b..4afd75ce5875 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>> /* Whether this change is for write protecting */
>> #define MM_CP_UFFD_WP (1UL << 2) /* do wp */
>> #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
>> +/* Whether to try to mark entries as dirty as they are to be written */
>> +#define MM_CP_WILL_NEED (1UL << 4)
>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
>> MM_CP_UFFD_WP_RESOLVE)
>> 
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 996a97e213ad..34c2dfb68c42 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>> bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>> bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>> bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>> + bool will_need = cp_flags & MM_CP_WILL_NEED;
>> 
>> tlb_change_page_size(tlb, PAGE_SIZE);
>> 
>> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>> ptent = pte_clear_uffd_wp(ptent);
>> }
>> 
>> + if (will_need)
>> + ptent = pte_mkyoung(ptent);
>> +
>> /*
>> * In some writable, shared mappings, we might want
>> * to catch actual write access -- see
>> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>> */
>> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> !pte_write(ptent) &&
> 
> 
> Why would we want to check if we can set something writable if it
> already *is* writable? That doesn't make sense to me.

We check !pte_write(). What am I missing in your question?

Having said that, I do notice now that pte_mkdirty() should not be done
only this condition is fulfilled. Instead we should just have
something like:

                       if (will_need) {
                               ptent = pte_mkyoung(ptent);
                               if (pte_write(ptent))
                                       ptent = pte_mkdirty(ptent);
                       }

But I do not think this answers your question, which I did not understand.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 17:36     ` Nadav Amit
@ 2022-07-20 18:00       ` David Hildenbrand
  2022-07-20 18:09         ` Nadav Amit
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20 18:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On 20.07.22 19:36, Nadav Amit wrote:
> On Jul 20, 2022, at 2:42 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>> ⚠ External Email
>>
>> On 18.07.22 14:01, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>>
>>> When userfaultfd makes a PTE writable, it can now change the PTE
>>> directly, in some cases, without going triggering a page-fault first.
>>> Yet, doing so might leave the PTE that was write-unprotected as old and
>>> clean. At least on x86, this would cause a >500 cycles overhead when the
>>> PTE is first accessed.
>>>
>>> Use MM_CP_WILL_NEED to set the PTE as young and dirty when userfaultfd
>>> gets a hint that the page is likely to be used. Avoid changing the PTE
>>> to young and dirty in other cases to avoid excessive writeback and
>>> messing with the page reclamation logic.
>>>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Yu Zhao <yuzhao@google.com>
>>> Cc: Nick Piggin <npiggin@gmail.com>
>>> ---
>>> include/linux/mm.h | 2 ++
>>> mm/mprotect.c | 9 ++++++++-
>>> mm/userfaultfd.c | 8 ++++++--
>>> 3 files changed, 16 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 9cc02a7e503b..4afd75ce5875 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1988,6 +1988,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>> /* Whether this change is for write protecting */
>>> #define MM_CP_UFFD_WP (1UL << 2) /* do wp */
>>> #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
>>> +/* Whether to try to mark entries as dirty as they are to be written */
>>> +#define MM_CP_WILL_NEED (1UL << 4)
>>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
>>> MM_CP_UFFD_WP_RESOLVE)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 996a97e213ad..34c2dfb68c42 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -82,6 +82,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>> bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>>> bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>>> bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
>>> + bool will_need = cp_flags & MM_CP_WILL_NEED;
>>>
>>> tlb_change_page_size(tlb, PAGE_SIZE);
>>>
>>> @@ -172,6 +173,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>> ptent = pte_clear_uffd_wp(ptent);
>>> }
>>>
>>> + if (will_need)
>>> + ptent = pte_mkyoung(ptent);
>>> +
>>> /*
>>> * In some writable, shared mappings, we might want
>>> * to catch actual write access -- see
>>> @@ -187,8 +191,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>>> */
>>> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>>> !pte_write(ptent) &&
>>
>>
>> Why would we want to check if we can set something writable if it
>> already *is* writable? That doesn't make sense to me.
> 
> We check !pte_write(). What am I missing in your question?

My patch review skills have seen better days. I thought you'd be
removing the pte_write() check ... :( Tired eyes ...

> 
> Having said that, I do notice now that pte_mkdirty() should not be done
> only this condition is fulfilled. Instead we should just have
> something like:
> 
>                        if (will_need) {
>                                ptent = pte_mkyoung(ptent);
>                                if (pte_write(ptent))
>                                        ptent = pte_mkdirty(ptent);
>                        }

As can_change_pte_writable() will fail if it stumbles over a !pte_dirty
page in current code ... so I assume you would have that code before the
actual pte_mkwrite() logic, correct?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 18:00       ` David Hildenbrand
@ 2022-07-20 18:09         ` Nadav Amit
  2022-07-20 18:11           ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Nadav Amit @ 2022-07-20 18:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On Jul 20, 2022, at 11:00 AM, David Hildenbrand <david@redhat.com> wrote:

> My patch review skills have seen better days. I thought you'd be
> removing the pte_write() check ... :( Tired eyes ...
> 
>> Having said that, I do notice now that pte_mkdirty() should not be done
>> only this condition is fulfilled. Instead we should just have
>> something like:
>> 
>> if (will_need) {
>> ptent = pte_mkyoung(ptent);
>> if (pte_write(ptent))
>> ptent = pte_mkdirty(ptent);
>> }
> 
> As can_change_pte_writable() will fail if it stumbles over a !pte_dirty
> page in current code ... so I assume you would have that code before the
> actual pte_mkwrite() logic, correct?

No, I thought this should go after for 2 reasons:

1. You want to allow the PTE to be made writable following the
can_change_pte_writable().

2. You do not want to set a non-writable PTE as dirty, especially since it
might then be used to determine that the PTE can become writable. Doing so
would circumvent cases in which set_page_dirty() needs to be called and
break things down.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 18:09         ` Nadav Amit
@ 2022-07-20 18:11           ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20 18:11 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On 20.07.22 20:09, Nadav Amit wrote:
> On Jul 20, 2022, at 11:00 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>> My patch review skills have seen better days. I thought you'd be
>> removing the pte_write() check ... :( Tired eyes ...
>>
>>> Having said that, I do notice now that pte_mkdirty() should not be done
>>> only this condition is fulfilled. Instead we should just have
>>> something like:
>>>
>>> if (will_need) {
>>> ptent = pte_mkyoung(ptent);
>>> if (pte_write(ptent))
>>> ptent = pte_mkdirty(ptent);
>>> }
>>
>> As can_change_pte_writable() will fail if it stumbles over a !pte_dirty
>> page in current code ... so I assume you would have that code before the
>> actual pte_mkwrite() logic, correct?
> 
> No, I thought this should go after for 2 reasons:
> 
> 1. You want to allow the PTE to be made writable following the
> can_change_pte_writable().
> 
> 2. You do not want to set a non-writable PTE as dirty, especially since it
> might then be used to determine that the PTE can become writable. Doing so
> would circumvent cases in which set_page_dirty() needs to be called and
> break things down.

The I'm confused how can_change_pte_writable() would ever allow for
that. Best to show me the code :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 15:10         ` David Hildenbrand
@ 2022-07-20 19:15           ` Peter Xu
  2022-07-20 19:33             ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2022-07-20 19:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
> For pagecache pages it may as well be *plain wrong* to bypass the write
> fault handler and simply mark pages dirty+map them writable.

Could you elaborate?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 19:15           ` Peter Xu
@ 2022-07-20 19:33             ` David Hildenbrand
  2022-07-20 19:48               ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20 19:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 20.07.22 21:15, Peter Xu wrote:
> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>> For pagecache pages it may as well be *plain wrong* to bypass the write
>> fault handler and simply mark pages dirty+map them writable.
> 
> Could you elaborate?

Write-fault handling for some filesystems (that even require this
"slow path") is a bit special.

For example, do_shared_fault() might have to call page_mkwrite().

AFAIK file systems use that for lazy allocation of disk blocks.
If you simply go ahead and map a !dirty pagecache page writable
and mark it dirty, it will not trigger page_mkwrite() and you might
end up corrupting data.

That's why we the old change_pte_range() code never touched
anything if the pte wasn't already dirty. Because as long as it's not writable,
the FS might have to be informed about the write-unprotect.

And we end up in the case here for VM_SHARED with vma_wants_writenotify().
Where we, for example, check

/* The backer wishes to know when pages are first written to? *
if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))$
	return 1;

Long story short, we should be really careful with write-fault handler bypasses,
especially when deciding to set dirty bits. For pagecache pages, we have to be
especially careful.

For exclusive anon pages it's mostly ok, because wp_page_reuse()
doesn't really contain that much magic. The only thing to consider for ordinary
mprotect() is that there is -- IMHO -- absolutely no guarantee that someone will
write to a specific write-unprotected page soon. For uffd-wp-unprotect it might be
easier to guess, especially, if we un-protect only a single page.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 19:33             ` David Hildenbrand
@ 2022-07-20 19:48               ` Peter Xu
  2022-07-20 19:55                 ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2022-07-20 19:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
> On 20.07.22 21:15, Peter Xu wrote:
> > On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
> >> For pagecache pages it may as well be *plain wrong* to bypass the write
> >> fault handler and simply mark pages dirty+map them writable.
> > 
> > Could you elaborate?
> 
> Write-fault handling for some filesystems (that even require this
> "slow path") is a bit special.
> 
> For example, do_shared_fault() might have to call page_mkwrite().
> 
> AFAIK file systems use that for lazy allocation of disk blocks.
> If you simply go ahead and map a !dirty pagecache page writable
> and mark it dirty, it will not trigger page_mkwrite() and you might
> end up corrupting data.
> 
> That's why we the old change_pte_range() code never touched
> anything if the pte wasn't already dirty.

I don't think that pte_dirty() check was for the pagecache code. For any fs
that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
return 1, so we'll never try to add write bit, hence we'll never even try
to check pte_dirty().

> Because as long as it's not writable,
> the FS might have to be informed about the write-unprotect.
> 
> And we end up in the case here for VM_SHARED with vma_wants_writenotify().
> Where we, for example, check
> 
> /* The backer wishes to know when pages are first written to? *
> if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))$
> 	return 1;
> 
> 
> Long story short, we should be really careful with write-fault handler bypasses,
> especially when deciding to set dirty bits. For pagecache pages, we have to be
> especially careful.

Since you mentioned page_mkwrite, IMHO it's really the write bit not dirty
bit that matters here (and IMHO that's why it's called page_mkwrite() not
page_mkdirty()).  Here Nadav's patch added pte_mkdirty() only if
pte_mkwrite() happens.  So I'm a bit confused on what's your worry, and
what you're against doing.

Say, even if with my original proposal to set dirty unconditionally, it'll
be still be after the pte_mkwrite().  I never see how that could affect
page_mkwrite not to mention it'll not even reach there.

> 
> For exclusive anon pages it's mostly ok, because wp_page_reuse()
> doesn't really contain that much magic. The only thing to consider for ordinary
> mprotect() is that there is -- IMHO -- absolutely no guarantee that someone will
> write to a specific write-unprotected page soon. For uffd-wp-unprotect it might be
> easier to guess, especially, if we un-protect only a single page.

Yeh, as mentioned I think that's a valid point - looks good to me to attach
the dirty bit only when with a hint.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 19:48               ` Peter Xu
@ 2022-07-20 19:55                 ` David Hildenbrand
  2022-07-20 20:22                   ` Nadav Amit
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20 19:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nadav Amit, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Nadav Amit, Andrea Arcangeli, Andrew Cooper,
	Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin

On 20.07.22 21:48, Peter Xu wrote:
> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
>> On 20.07.22 21:15, Peter Xu wrote:
>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>>>> For pagecache pages it may as well be *plain wrong* to bypass the write
>>>> fault handler and simply mark pages dirty+map them writable.
>>>
>>> Could you elaborate?
>>
>> Write-fault handling for some filesystems (that even require this
>> "slow path") is a bit special.
>>
>> For example, do_shared_fault() might have to call page_mkwrite().
>>
>> AFAIK file systems use that for lazy allocation of disk blocks.
>> If you simply go ahead and map a !dirty pagecache page writable
>> and mark it dirty, it will not trigger page_mkwrite() and you might
>> end up corrupting data.
>>
>> That's why we the old change_pte_range() code never touched
>> anything if the pte wasn't already dirty.
> 
> I don't think that pte_dirty() check was for the pagecache code. For any fs
> that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
> return 1, so we'll never try to add write bit, hence we'll never even try
> to check pte_dirty().
> 

I might be too tired, but the whole reason we had this magic before my
commit in place was only for the pagecache.

With vma_wants_writenotify()=0 you can directly map the pages writable
and don't have to do these advanced checks here. In a writable
MAP_SHARED VMA you'll already have pte_write().

We only get !pte_write() in case we have vma_wants_writenotify()=1 ...

  try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);

and that's the code that checked the dirty bit after all to decide --
amongst other things -- if we can simply map it writable without going
via the write fault handler and triggering do_shared_fault() .

See crazy/ugly FOLL_FORCE code in GUP that similarly checks the dirty bit.

But yeah, it's all confusing so I might just be wrong regarding
pagecache pages.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 19:55                 ` David Hildenbrand
@ 2022-07-20 20:22                   ` Nadav Amit
  2022-07-20 20:38                     ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Nadav Amit @ 2022-07-20 20:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

On Jul 20, 2022, at 12:55 PM, David Hildenbrand <david@redhat.com> wrote:

> On 20.07.22 21:48, Peter Xu wrote:
>> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
>>> On 20.07.22 21:15, Peter Xu wrote:
>>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>>>>> For pagecache pages it may as well be *plain wrong* to bypass the write
>>>>> fault handler and simply mark pages dirty+map them writable.
>>>> 
>>>> Could you elaborate?
>>> 
>>> Write-fault handling for some filesystems (that even require this
>>> "slow path") is a bit special.
>>> 
>>> For example, do_shared_fault() might have to call page_mkwrite().
>>> 
>>> AFAIK file systems use that for lazy allocation of disk blocks.
>>> If you simply go ahead and map a !dirty pagecache page writable
>>> and mark it dirty, it will not trigger page_mkwrite() and you might
>>> end up corrupting data.
>>> 
>>> That's why we the old change_pte_range() code never touched
>>> anything if the pte wasn't already dirty.
>> 
>> I don't think that pte_dirty() check was for the pagecache code. For any fs
>> that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
>> return 1, so we'll never try to add write bit, hence we'll never even try
>> to check pte_dirty().
> 
> I might be too tired, but the whole reason we had this magic before my
> commit in place was only for the pagecache.
> 
> With vma_wants_writenotify()=0 you can directly map the pages writable
> and don't have to do these advanced checks here. In a writable
> MAP_SHARED VMA you'll already have pte_write().
> 
> We only get !pte_write() in case we have vma_wants_writenotify()=1 ...
> 
>  try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
> 
> and that's the code that checked the dirty bit after all to decide --
> amongst other things -- if we can simply map it writable without going
> via the write fault handler and triggering do_shared_fault() .
> 
> See crazy/ugly FOLL_FORCE code in GUP that similarly checks the dirty bit.

I thought you want to get rid of it at least for anonymous pages. No?

> 
> But yeah, it's all confusing so I might just be wrong regarding
> pagecache pages.

Just to note: I am not very courageous and I did not intend to change
condition for when non-anonymous pages are set as writable. That’s the
reason I did not change the dirty for non-writable non-anonymous entries (as
Peter said). And that’s the reason that setting the dirty bit (at least as I
should have done it) is only performed after we made the decision on the
write-bit.

IOW, after you made your decision about the write-bit, then and only then
you may be able to set the dirty bit for writable entries. Since the entry
is already writeable (i.e., can be written without a fault later directly
from userspace), there should be no concern of correctness when you set it.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 20:22                   ` Nadav Amit
@ 2022-07-20 20:38                     ` David Hildenbrand
  2022-07-20 20:56                       ` Nadav Amit
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-20 20:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

On 20.07.22 22:22, Nadav Amit wrote:
> On Jul 20, 2022, at 12:55 PM, David Hildenbrand <david@redhat.com> wrote:
> 
>> On 20.07.22 21:48, Peter Xu wrote:
>>> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
>>>> On 20.07.22 21:15, Peter Xu wrote:
>>>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>>>>>> For pagecache pages it may as well be *plain wrong* to bypass the write
>>>>>> fault handler and simply mark pages dirty+map them writable.
>>>>>
>>>>> Could you elaborate?
>>>>
>>>> Write-fault handling for some filesystems (that even require this
>>>> "slow path") is a bit special.
>>>>
>>>> For example, do_shared_fault() might have to call page_mkwrite().
>>>>
>>>> AFAIK file systems use that for lazy allocation of disk blocks.
>>>> If you simply go ahead and map a !dirty pagecache page writable
>>>> and mark it dirty, it will not trigger page_mkwrite() and you might
>>>> end up corrupting data.
>>>>
>>>> That's why we the old change_pte_range() code never touched
>>>> anything if the pte wasn't already dirty.
>>>
>>> I don't think that pte_dirty() check was for the pagecache code. For any fs
>>> that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
>>> return 1, so we'll never try to add write bit, hence we'll never even try
>>> to check pte_dirty().
>>
>> I might be too tired, but the whole reason we had this magic before my
>> commit in place was only for the pagecache.
>>
>> With vma_wants_writenotify()=0 you can directly map the pages writable
>> and don't have to do these advanced checks here. In a writable
>> MAP_SHARED VMA you'll already have pte_write().
>>
>> We only get !pte_write() in case we have vma_wants_writenotify()=1 ...
>>
>>  try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>
>> and that's the code that checked the dirty bit after all to decide --
>> amongst other things -- if we can simply map it writable without going
>> via the write fault handler and triggering do_shared_fault() .
>>
>> See crazy/ugly FOLL_FORCE code in GUP that similarly checks the dirty bit.
> 
> I thought you want to get rid of it at least for anonymous pages. No?

Yes. Especially for any MAP_PRIVATE mappings.

If you want to write to something that's not mapped writable in a
MAP_PRIVATE mapping it
a) Has to be an exclusive anonymous page
b) The pte has to be dirty

In any other case, you clearly missed to COW or the modifications might
get lost if the PTE is not dirty.

MAP_SHARED is a bit more involved. But whether the pte is dirty might be
good enough ... but this needs a lot more care.

> 
>>
>> But yeah, it's all confusing so I might just be wrong regarding
>> pagecache pages.
> 
> Just to note: I am not very courageous and I did not intend to change
> condition for when non-anonymous pages are set as writable. That’s the
> reason I did not change the dirty for non-writable non-anonymous entries (as
> Peter said). And that’s the reason that setting the dirty bit (at least as I
> should have done it) is only performed after we made the decision on the
> write-bit.

Good. As long as we stick to anonymous pages I roughly know what we we
can and cannot do at this point :)


The problem I see is that detection whether we can write requires the
dirty bit ... and whether to set the dirty bit requires the information
whether we can write.

Again, for anonymous pages we should be able to relax the "dirty"
requirement when detecting whether we can write.

> 
> IOW, after you made your decision about the write-bit, then and only then
> you may be able to set the dirty bit for writable entries. Since the entry
> is already writeable (i.e., can be written without a fault later directly
> from userspace), there should be no concern of correctness when you set it.

That makes sense to me. What keeps confusing me are architectures with
and without a hw-managed dirty bit ... :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 20:38                     ` David Hildenbrand
@ 2022-07-20 20:56                       ` Nadav Amit
  2022-07-21  7:52                         ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Nadav Amit @ 2022-07-20 20:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

On Jul 20, 2022, at 1:38 PM, David Hildenbrand <david@redhat.com> wrote:

> On 20.07.22 22:22, Nadav Amit wrote:
>> On Jul 20, 2022, at 12:55 PM, David Hildenbrand <david@redhat.com> wrote:
>> 
>>> On 20.07.22 21:48, Peter Xu wrote:
>>>> On Wed, Jul 20, 2022 at 09:33:35PM +0200, David Hildenbrand wrote:
>>>>> On 20.07.22 21:15, Peter Xu wrote:
>>>>>> On Wed, Jul 20, 2022 at 05:10:37PM +0200, David Hildenbrand wrote:
>>>>>>> For pagecache pages it may as well be *plain wrong* to bypass the write
>>>>>>> fault handler and simply mark pages dirty+map them writable.
>>>>>> 
>>>>>> Could you elaborate?
>>>>> 
>>>>> Write-fault handling for some filesystems (that even require this
>>>>> "slow path") is a bit special.
>>>>> 
>>>>> For example, do_shared_fault() might have to call page_mkwrite().
>>>>> 
>>>>> AFAIK file systems use that for lazy allocation of disk blocks.
>>>>> If you simply go ahead and map a !dirty pagecache page writable
>>>>> and mark it dirty, it will not trigger page_mkwrite() and you might
>>>>> end up corrupting data.
>>>>> 
>>>>> That's why we the old change_pte_range() code never touched
>>>>> anything if the pte wasn't already dirty.
>>>> 
>>>> I don't think that pte_dirty() check was for the pagecache code. For any fs
>>>> that has page_mkwrite() defined, it'll already have vma_wants_writenotify()
>>>> return 1, so we'll never try to add write bit, hence we'll never even try
>>>> to check pte_dirty().
>>> 
>>> I might be too tired, but the whole reason we had this magic before my
>>> commit in place was only for the pagecache.
>>> 
>>> With vma_wants_writenotify()=0 you can directly map the pages writable
>>> and don't have to do these advanced checks here. In a writable
>>> MAP_SHARED VMA you'll already have pte_write().
>>> 
>>> We only get !pte_write() in case we have vma_wants_writenotify()=1 ...
>>> 
>>> try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>> 
>>> and that's the code that checked the dirty bit after all to decide --
>>> amongst other things -- if we can simply map it writable without going
>>> via the write fault handler and triggering do_shared_fault() .
>>> 
>>> See crazy/ugly FOLL_FORCE code in GUP that similarly checks the dirty bit.
>> 
>> I thought you want to get rid of it at least for anonymous pages. No?
> 
> Yes. Especially for any MAP_PRIVATE mappings.
> 
> If you want to write to something that's not mapped writable in a
> MAP_PRIVATE mapping it
> a) Has to be an exclusive anonymous page
> b) The pte has to be dirty

Do you need both conditions to be true? I thought (a) is sufficient (if
the soft-dirty and related checks succeed).

> 
> In any other case, you clearly missed to COW or the modifications might
> get lost if the PTE is not dirty.
> 
> MAP_SHARED is a bit more involved. But whether the pte is dirty might be
> good enough ... but this needs a lot more care.
> 
>>> But yeah, it's all confusing so I might just be wrong regarding
>>> pagecache pages.
>> 
>> Just to note: I am not very courageous and I did not intend to change
>> condition for when non-anonymous pages are set as writable. That’s the
>> reason I did not change the dirty for non-writable non-anonymous entries (as
>> Peter said). And that’s the reason that setting the dirty bit (at least as I
>> should have done it) is only performed after we made the decision on the
>> write-bit.
> 
> Good. As long as we stick to anonymous pages I roughly know what we we
> can and cannot do at this point :)
> 
> 
> The problem I see is that detection whether we can write requires the
> dirty bit ... and whether to set the dirty bit requires the information
> whether we can write.
> 
> Again, for anonymous pages we should be able to relax the "dirty"
> requirement when detecting whether we can write.

That’s all I wanted to do there.

> 
>> IOW, after you made your decision about the write-bit, then and only then
>> you may be able to set the dirty bit for writable entries. Since the entry
>> is already writeable (i.e., can be written without a fault later directly
>> from userspace), there should be no concern of correctness when you set it.
> 
> That makes sense to me. What keeps confusing me are architectures with
> and without a hw-managed dirty bit ... :)

I don’t know which arch you have in your mind. But the moment a PTE is
writable, then marking it logically/architecturally as dirty should be
fine.

But… if the Exclusive check is not good enough for private+anon without
the “logical” dirty bit, then there would be a problem. 




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable
  2022-07-20 17:25     ` Nadav Amit
@ 2022-07-21  7:45       ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-07-21  7:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, LKML, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Andrea Arcangeli, Andrew Cooper, Andy Lutomirski, Dave Hansen,
	Peter Xu, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Nick Piggin

On 20.07.22 19:25, Nadav Amit wrote:
> On Jul 20, 2022, at 8:19 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>> On 18.07.22 14:02, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>>
>>> Anonymous pages might have the dirty bit clear, but this should not
>>> prevent mprotect from making them writable if they are exclusive.
>>> Therefore, skip the test whether the page is dirty in this case.
>>>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: Yu Zhao <yuzhao@google.com>
>>> Cc: Nick Piggin <npiggin@gmail.com>
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> ---
>>> mm/mprotect.c | 5 +++--
>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 34c2dfb68c42..da5b9bf8204f 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>>
>>> 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>>>
>>> -	if (pte_protnone(pte) || !pte_dirty(pte))
>>> +	if (pte_protnone(pte))
>>> 		return false;
>>>
>>> 	/* Do we need write faults for softdirty tracking? */
>>> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>> 		page = vm_normal_page(vma, addr, pte);
>>> 		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>>> 			return false;
>>> -	}
>>> +	} else if (!pte_dirty(pte))
>>> +		return false;
>>>
>>> 	return true;
>>> }
>>
>> When I wrote that code, I was wondering how often that would actually
>> happen in practice -- and if we care about optimizing that. Do you have
>> a gut feeling in which scenarios this would happen and if we care?
>>
>> If the page is in the swapcache and was swapped out, you'd be requiring
>> a writeback even though nobody modified the page and possibly isn't
>> going to do so in the near future.
> 
> So here is my due diligence: I did not really encounter a scenario in which
> it showed up. When I looked at your code, I assumed this was an oversight
> and not a thoughtful decision. For me the issue is more of the discrepancy
> between how a certain page is handled before and after it was pages out.
> 
> The way that I see it, there is a tradeoff in the way dirty-bit should
> be handled:
> (1) Writable-clean PTEs introduce some non-negligible overhead.
> (2) Marking a PTE dirty speculatively would require a write back.
> 
> … But this tradeoff should not affect whether a PTE is writable, i.e.,
> mapping the PTE as writable-clean should not cause a writeback. In other
> words, if you are concerned about unnecessary writebacks, which I think is a
> fair concern, then do not set the dirty-bit. In a later patch I try to avoid
> TLB flushes on clean-writable entries that are write-protected.
> 
> So I do not think that the writeback you mentioned should be a real issue.
> Yet if you think that using the fact that the page is not-dirty is a good
> hueristics to avoid future TLB flushes (for P->NP; as I said there is a
> solution for RW->RO), or if you are concerned about the cost of
> vm_normal_page(), perhaps those are valid concerned (although I do not think
> so).

I think I now understand what you mean. I somehow remembered that some
architectures set a PTE dirty when marking it writable, but I guess this
is not true -- and setting it writable will keep it !dirty until really
accessed (either by the HW or by a fault). [I'll do some more digging
just to confirm]

With that in mind, your patch makes sense, and I guess you'd want that
as patch #1, because otherwise I fail to see how current patch #1 would
even succeed in reaching the "pte_mkdirty" call -- if !pte_dirty()
protects the code from running.

> 
> --
> 
> [ Regarding (1): After some discussions with Peter and reading more code, I
> thought at some point that perhaps avoiding having writable-clean PTE as
> much as possible makes sense [*], since setting the dirty-bit costs ~550
> cycles and a page fault is not a lot more than 1000. But with all the
> mitigations (and after adding IBRS for retbless) page-fault entry is kind of
> expensive. 

I understand the reasoning for anonymous memory, but not for page cache
pages. And for anonymous memory I think there are still cases where we
don't want to do that (swapcache and MADV_FREE comes to mind) -- and
IMHO some of the scenarios where we have clean anonymous memory at all,
we just don't care about optimizing for setting the dirty bit faster on
x86 ;) .

So my gut feeling is to keep it as simple as possible. To me, it
translates to setting the dirty bit only if we have clear indication
that write access is likely next.

Thanks Nadav for refreshing my memory :) !

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-20 20:56                       ` Nadav Amit
@ 2022-07-21  7:52                         ` David Hildenbrand
  2022-07-21 14:10                           ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2022-07-21  7:52 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

>> Yes. Especially for any MAP_PRIVATE mappings.
>>
>> If you want to write to something that's not mapped writable in a
>> MAP_PRIVATE mapping it
>> a) Has to be an exclusive anonymous page
>> b) The pte has to be dirty
> 
> Do you need both conditions to be true? I thought (a) is sufficient (if
> the soft-dirty and related checks succeed).

If we force-write to a page, we need it to be dirty to tell reclaim code
that the content stale. We can either mark the pte dirty manually, or
just let the write fault handler deal with it to simplify GUP code. This
needs some more thought, but that's my understanding.

> 
>>
>> In any other case, you clearly missed to COW or the modifications might
>> get lost if the PTE is not dirty.
>>
>> MAP_SHARED is a bit more involved. But whether the pte is dirty might be
>> good enough ... but this needs a lot more care.
>>
>>>> But yeah, it's all confusing so I might just be wrong regarding
>>>> pagecache pages.
>>>
>>> Just to note: I am not very courageous and I did not intend to change
>>> condition for when non-anonymous pages are set as writable. That’s the
>>> reason I did not change the dirty for non-writable non-anonymous entries (as
>>> Peter said). And that’s the reason that setting the dirty bit (at least as I
>>> should have done it) is only performed after we made the decision on the
>>> write-bit.
>>
>> Good. As long as we stick to anonymous pages I roughly know what we we
>> can and cannot do at this point :)
>>
>>
>> The problem I see is that detection whether we can write requires the
>> dirty bit ... and whether to set the dirty bit requires the information
>> whether we can write.
>>
>> Again, for anonymous pages we should be able to relax the "dirty"
>> requirement when detecting whether we can write.
> 
> That’s all I wanted to do there.
> 
>>
>>> IOW, after you made your decision about the write-bit, then and only then
>>> you may be able to set the dirty bit for writable entries. Since the entry
>>> is already writeable (i.e., can be written without a fault later directly
>>> from userspace), there should be no concern of correctness when you set it.
>>
>> That makes sense to me. What keeps confusing me are architectures with
>> and without a hw-managed dirty bit ... :)
> 
> I don’t know which arch you have in your mind. But the moment a PTE is
> writable, then marking it logically/architecturally as dirty should be
> fine.
> 
> But… if the Exclusive check is not good enough for private+anon without
> the “logical” dirty bit, then there would be a problem. 

I think we are good for anonymous pages.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect
  2022-07-21  7:52                         ` David Hildenbrand
@ 2022-07-21 14:10                           ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2022-07-21 14:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux MM, LKML, Andrew Morton, Mike Rapoport,
	Axel Rasmussen, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin

On 21.07.22 09:52, David Hildenbrand wrote:
>>> Yes. Especially for any MAP_PRIVATE mappings.
>>>
>>> If you want to write to something that's not mapped writable in a
>>> MAP_PRIVATE mapping it
>>> a) Has to be an exclusive anonymous page
>>> b) The pte has to be dirty
>>
>> Do you need both conditions to be true? I thought (a) is sufficient (if
>> the soft-dirty and related checks succeed).
> 
> If we force-write to a page, we need it to be dirty to tell reclaim code
> that the content stale. We can either mark the pte dirty manually, or
> just let the write fault handler deal with it to simplify GUP code. This
> needs some more thought, but that's my understanding.

Extending on my previous answer after staring at the code

a) I have to dig if the FOLL_FORCE special-retry-handling is required
for MAP_SHARED at all.

check_vma_flags() allows FOLL_FORCE only on MAP_PRIVATE VMAs that lack
VM_WRITE.

Consequently, I would have assumed that the first write fault should be
sufficient on a MAP_SHARED VMA to actually map the PTE writable and not
require any of that special retry magic.

b) I wonder if we have to take care of uffd-wp and softdirty (just like
in mprotect code here) as well in case we stumble over an exclusive
anonymous page. Yes, the VMA is currently not writable, but I'd have
expected at least softdirty tracking to apply.

... I'll dig into the details.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-07-21 14:11 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20220718120212.3180-1-namit@vmware.com>
2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
2022-07-19 20:47   ` Peter Xu
2022-07-20  9:39     ` David Hildenbrand
2022-07-20 13:10       ` Peter Xu
2022-07-20 15:10         ` David Hildenbrand
2022-07-20 19:15           ` Peter Xu
2022-07-20 19:33             ` David Hildenbrand
2022-07-20 19:48               ` Peter Xu
2022-07-20 19:55                 ` David Hildenbrand
2022-07-20 20:22                   ` Nadav Amit
2022-07-20 20:38                     ` David Hildenbrand
2022-07-20 20:56                       ` Nadav Amit
2022-07-21  7:52                         ` David Hildenbrand
2022-07-21 14:10                           ` David Hildenbrand
2022-07-20  9:42   ` David Hildenbrand
2022-07-20 17:36     ` Nadav Amit
2022-07-20 18:00       ` David Hildenbrand
2022-07-20 18:09         ` Nadav Amit
2022-07-20 18:11           ` David Hildenbrand
2022-07-18 12:02 ` [RFC PATCH 02/14] userfaultfd: try to map write-unprotected pages Nadav Amit
2022-07-19 20:49   ` Peter Xu
2022-07-18 12:02 ` [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable Nadav Amit
2022-07-20 15:19   ` David Hildenbrand
2022-07-20 17:25     ` Nadav Amit
2022-07-21  7:45       ` David Hildenbrand
2022-07-18 12:02 ` [RFC PATCH 04/14] mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 07/14] mm: do fix spurious page-faults for instruction faults Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 10/14] x86/mm: introduce relaxed TLB flushes Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean Nadav Amit
2022-07-18 12:02 ` [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).