[PATCH v3 0/5] implement lightweight guard pages

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] implement lightweight guard pages
@ 2024-10-23 16:24 Lorenzo Stoakes
  2024-10-23 16:24 ` [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.

The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.

With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.

This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!)  - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.

Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.

We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.

This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.

After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.

By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.

This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.

The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
which installs page table-level guard page markers - and
MADV_GUARD_REMOVE - which clears them.

Guard markers can be installed across multiple VMAs and any existing
mappings will be cleared, that is zapped, before installing the guard page
markers in the page tables.

There is no concept of 'nested' guard markers, multiple attempts to install
guard markers in a range will, after the first attempt, have no effect.

Importantly, removing guard markers over a range that contains both guard
markers and ordinary backed memory has no effect on anything but the guard
markers (including leaving huge pages un-split), so a user can safely
remove guard markers over a range of memory leaving the rest intact.

The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.

Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.

We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).

We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
remove guard markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).

It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the markers if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.

An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of guard
ranges.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>

v3
* Cleaned up mm/pagewalk.c logic a bit to make things clearer, as suggested
  by Vlastiml.
* Explicitly avoid splitting THP on PTE installation, as suggested by
  Vlastimil. Note this has no impact on the guard pages logic, which has
  page table entry handlers at PUD, PMD and PTE level.
* Added WARN_ON_ONCE() to mm/hugetlb.c path where we don't expect a guard
  marker, as suggested by Vlastimil.
* Reverted change to is_poisoned_swp_entry() to exclude guard pages which
  has the effect of MADV_FREE _not_ clearing guard pages. After discussion
  with Vlastimil, it became apparent that the ability to 'cancel' the
  freeing operation by writing to the mapping after having issued an
  MADV_FREE would mean that we would risk unexpected behaviour should the
  guard pages be removed, so we now do not remove markers here at all.
* Added comment to PTE_MARKER_GUARD to highlight that memory tagged with
  the marker behaves as if it were a region mapped PROT_NONE, as
  highlighted by David.
* Rename poison -> install, unpoison -> remove (i.e. MADV_GUARD_INSTALL /
  MADV_GUARD_REMOVE over MADV_GUARD_POISON / MADV_GUARD_REMOVE) at the
  request of David and John who both find the poison analogy
  confusing/overloaded.
* After a lot of discussion, replace the looping behaviour should page
  faults race with guard page installation with a modest reattempt followed
  by returning -ERESTARTNOINTR to have the operation abort and re-enter,
  relieving lock contention and avoiding the possibility of allowing a
  malicious sandboxed process to impact the mmap lock or stall the overall
  process more than necessary, as suggested by Jann and Vlastimil having
  raised the issue.
* Adjusted the page table walker so a populated huge PUD or PMD is
  correctly treated as being populated, necessitating a zap. In v2 we
  incorrectly skipped over these, which would cause the logic to wrongly
  proceed as if nothing were populated and the install succeeded.
  Instead, explicitly check to see if a huge page - if so, do not split but
  rather abort the operation and let zap take care of things.
* Updated the guard remove logic to not unnecessarily split huge pages
  either.
* Added a debug check to assert that the number of installed PTEs matches
  expectation, accounting for any existing guard pages.
* Adapted vector_madvise() used by the process_madvise() system call to
  handle -ERESTARTNOINTR correctly.

v2
* The macros in kselftest_harness.h seem to be broken - __EXPECT() is
  terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not
  safe in single line if / else or for /which blocks, however working
  around this results in checkpatch producing invalid warnings, as reported
  by Shuah.
* Fixing these macros is out of scope for this series, so compromise and
  instead rewrite test blocks so as to use multiple lines by separating out
  a decl in most cases. This has the side effect of, for the most part,
  making things more readable.
* Heavily document the use of the volatile keyword - we can't avoid
  checkpatch complaining about this, so we explain it, as reported by
  Shuah.
* Updated commit message to highlight that we skip tests we lack
  permissions for, as reported by Shuah.
* Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah.
* Added user friendly messages to cases where tests are skipped due to lack
  of permissions, as reported by Shuah.
* Update the tool header to include the new MADV_GUARD_POISON/UNPOISON
  defines and directly include asm-generic/mman.h to get the
  platform-neutral versions to ensure we import them.
* Finally fixed Vlastimil's email address in Suggested-by tags from suze to
  suse, as reported by Vlastimil.
* Added linux-api to cc list, as reported by Vlastimil.
https://lore.kernel.org/all/cover.1729440856.git.lorenzo.stoakes@oracle.com/

v1
* Un-RFC'd as appears no major objections to approach but rather debate on
  implementation.
* Fixed issue with arches which need mmu_context.h and
  tlbfush.h. header imports in pagewalker logic to be able to use
  update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
  ops->install_pte and why as well as adding a check_ops_valid() helper
  function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
  by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
  suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
  consistent with how this is used elsewhere in the kernel as suggested by
  Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
  and duck out if a fatal signal is pending or a conditional reschedule is
  needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
  ACTION_CONTINUE, as suggested by Jann.
https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/

RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (5):
  mm: pagewalk: add the ability to install PTEs
  mm: add PTE_MARKER_GUARD PTE marker
  mm: madvise: implement lightweight guard page mechanism
  tools: testing: update tools UAPI header for mman-common.h
  selftests/mm: add self tests for guard page feature

 arch/alpha/include/uapi/asm/mman.h           |    3 +
 arch/mips/include/uapi/asm/mman.h            |    3 +
 arch/parisc/include/uapi/asm/mman.h          |    3 +
 arch/xtensa/include/uapi/asm/mman.h          |    3 +
 include/linux/mm_inline.h                    |    2 +-
 include/linux/pagewalk.h                     |   18 +-
 include/linux/swapops.h                      |   24 +-
 include/uapi/asm-generic/mman-common.h       |    3 +
 mm/hugetlb.c                                 |    4 +
 mm/internal.h                                |   12 +
 mm/madvise.c                                 |  225 ++++
 mm/memory.c                                  |   18 +-
 mm/mprotect.c                                |    6 +-
 mm/mseal.c                                   |    1 +
 mm/pagewalk.c                                |  227 +++-
 tools/include/uapi/asm-generic/mman-common.h |    3 +
 tools/testing/selftests/mm/.gitignore        |    1 +
 tools/testing/selftests/mm/Makefile          |    1 +
 tools/testing/selftests/mm/guard-pages.c     | 1239 ++++++++++++++++++
 19 files changed, 1720 insertions(+), 76 deletions(-)
 create mode 100644 tools/testing/selftests/mm/guard-pages.c

--
2.47.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-23 16:24 [PATCH v3 0/5] implement lightweight guard pages Lorenzo Stoakes
@ 2024-10-23 16:24 ` Lorenzo Stoakes
  2024-10-23 23:04   ` Andrew Morton
                     ` (2 more replies)
  2024-10-23 16:24 ` [PATCH v3 2/5] mm: add PTE_MARKER_GUARD PTE marker Lorenzo Stoakes
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

The existing generic pagewalk logic permits the walking of page tables,
invoking callbacks at individual page table levels via user-provided
mm_walk_ops callbacks.

This is useful for traversing existing page table entries, but precludes
the ability to establish new ones.

Existing mechanism for performing a walk which also installs page table
entries if necessary are heavily duplicated throughout the kernel, each
with semantic differences from one another and largely unavailable for use
elsewhere.

Rather than add yet another implementation, we extend the generic pagewalk
logic to enable the installation of page table entries by adding a new
install_pte() callback in mm_walk_ops. If this is specified, then upon
encountering a missing page table entry, we allocate and install a new one
and continue the traversal.

If a THP huge page is encountered at either the PMD or PUD level we split
it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
otherwise if there is only an ops->install_pte(), we avoid the unnecessary
split.

We do not support hugetlb at this stage.

If this function returns an error, or an allocation fails during the
operation, we abort the operation altogether. It is up to the caller to
deal appropriately with partially populated page table ranges.

If install_pte() is defined, the semantics of pte_entry() change - this
callback is then only invoked if the entry already exists. This is a useful
property, as it allows a caller to handle existing PTEs while installing
new ones where necessary in the specified range.

If install_pte() is not defined, then there is no functional difference to
this patch, so all existing logic will work precisely as it did before.

As we only permit the installation of PTEs where a mapping does not already
exist there is no need for TLB management, however we do invoke
update_mmu_cache() for architectures which require manual maintenance of
mappings for other CPUs.

We explicitly do not allow the existing page walk API to expose this
feature as it is dangerous and intended for internal mm use only. Therefore
we provide a new walk_page_range_mm() function exposed only to
mm/internal.h.

Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/pagewalk.h |  18 +++-
 mm/internal.h            |   6 ++
 mm/pagewalk.c            | 227 +++++++++++++++++++++++++++------------
 3 files changed, 182 insertions(+), 69 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index f5eb5a32aeed..9700a29f8afb 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -25,12 +25,15 @@ enum page_walk_lock {
  *			this handler is required to be able to handle
  *			pmd_trans_huge() pmds.  They may simply choose to
  *			split_huge_page() instead of handling it explicitly.
- * @pte_entry:		if set, called for each PTE (lowest-level) entry,
- *			including empty ones
+ * @pte_entry:		if set, called for each PTE (lowest-level) entry
+ *			including empty ones, except if @install_pte is set.
+ *			If @install_pte is set, @pte_entry is called only for
+ *			existing PTEs.
  * @pte_hole:		if set, called for each hole at all levels,
  *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
  *			Any folded depths (where PTRS_PER_P?D is equal to 1)
- *			are skipped.
+ *			are skipped. If @install_pte is specified, this will
+ *			not trigger for any populated ranges.
  * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
  *			function is called with the vma lock held, in order to
  *			protect against a concurrent freeing of the pte_t* or
@@ -51,6 +54,13 @@ enum page_walk_lock {
  * @pre_vma:            if set, called before starting walk on a non-null vma.
  * @post_vma:           if set, called after a walk on a non-null vma, provided
  *                      that @pre_vma and the vma walk succeeded.
+ * @install_pte:        if set, missing page table entries are installed and
+ *                      thus all levels are always walked in the specified
+ *                      range. This callback is then invoked at the PTE level
+ *                      (having split any THP pages prior), providing the PTE to
+ *                      install. If allocations fail, the walk is aborted. This
+ *                      operation is only available for userland memory. Not
+ *                      usable for hugetlb ranges.
  *
  * p?d_entry callbacks are called even if those levels are folded on a
  * particular architecture/configuration.
@@ -76,6 +86,8 @@ struct mm_walk_ops {
 	int (*pre_vma)(unsigned long start, unsigned long end,
 		       struct mm_walk *walk);
 	void (*post_vma)(struct mm_walk *walk);
+	int (*install_pte)(unsigned long addr, unsigned long next,
+			   pte_t *ptep, struct mm_walk *walk);
 	enum page_walk_lock walk_lock;
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 508f7802dd2b..fb1fb0c984e4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/pagemap.h>
+#include <linux/pagewalk.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -1451,4 +1452,9 @@ static inline void accept_page(struct page *page)
 }
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
+/* pagewalk.c */
+int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		void *private);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 5f9f01532e67..f3cbad384344 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,9 +3,14 @@
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/mmu_context.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
 /*
  * We want to know the real level where a entry is located ignoring any
  * folding of levels which may be happening. For example if p4d is folded then
@@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
 	int err = 0;
 
 	for (;;) {
-		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
-		if (err)
-		       break;
+		if (ops->install_pte && pte_none(ptep_get(pte))) {
+			pte_t new_pte;
+
+			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
+					       walk);
+			if (err)
+				break;
+
+			set_pte_at(walk->mm, addr, pte, new_pte);
+			/* Non-present before, so for arches that need it. */
+			if (!WARN_ON_ONCE(walk->no_vma))
+				update_mmu_cache(walk->vma, addr, pte);
+		} else {
+			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
+			if (err)
+				break;
+		}
 		if (addr >= end - PAGE_SIZE)
 			break;
 		addr += PAGE_SIZE;
@@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 again:
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pte_alloc(walk->mm, pmd);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 
 		walk->action = ACTION_SUBTREE;
@@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 		if (walk->action == ACTION_AGAIN)
 			goto again;
-
-		/*
-		 * Check this here so we only break down trans_huge
-		 * pages when we _need_ to
-		 */
-		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
-		    walk->action == ACTION_CONTINUE ||
-		    !(ops->pte_entry))
+		if (walk->action == ACTION_CONTINUE)
 			continue;
+		if (!ops->install_pte && !ops->pte_entry)
+			continue; /* Nothing to do. */
+		if (!ops->pte_entry && ops->install_pte &&
+		    pmd_present(*pmd) &&
+		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+			continue; /* Avoid unnecessary split. */
 
 		if (walk->vma)
 			split_huge_pmd(walk->vma, pmd, addr);
+		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
+			continue; /* Nothing to do. */
 
 		err = walk_pte_range(pmd, addr, next, walk);
 		if (err)
@@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
  again:
 		next = pud_addr_end(addr, end);
 		if (pud_none(*pud)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pmd_alloc(walk->mm, pud, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 
 		walk->action = ACTION_SUBTREE;
@@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 		if (walk->action == ACTION_AGAIN)
 			goto again;
-
-		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
-		    walk->action == ACTION_CONTINUE ||
-		    !(ops->pmd_entry || ops->pte_entry))
+		if (walk->action == ACTION_CONTINUE)
 			continue;
+		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
+			continue;  /* Nothing to do. */
+		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
+		    pud_present(*pud) &&
+		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
+			continue; /* Avoid unnecessary split. */
 
 		if (walk->vma)
 			split_huge_pud(walk->vma, pud, addr);
+		else if (pud_leaf(*pud) || !pud_present(*pud))
+			continue; /* Nothing to do. */
+
 		if (pud_none(*pud))
 			goto again;
 
@@ -196,18 +228,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __pud_alloc(walk->mm, p4d, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 		if (ops->p4d_entry) {
 			err = ops->p4d_entry(p4d, addr, next, walk);
 			if (err)
 				break;
 		}
-		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry ||
+		    ops->install_pte)
 			err = walk_pud_range(p4d, addr, next, walk);
 		if (err)
 			break;
@@ -231,18 +267,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
-			if (ops->pte_hole)
+			if (ops->install_pte)
+				err = __p4d_alloc(walk->mm, pgd, addr);
+			else if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, 0, walk);
 			if (err)
 				break;
-			continue;
+			if (!ops->install_pte)
+				continue;
 		}
 		if (ops->pgd_entry) {
 			err = ops->pgd_entry(pgd, addr, next, walk);
 			if (err)
 				break;
 		}
-		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
+		    ops->pte_entry || ops->install_pte)
 			err = walk_p4d_range(pgd, addr, next, walk);
 		if (err)
 			break;
@@ -334,6 +374,11 @@ static int __walk_page_range(unsigned long start, unsigned long end,
 	int err = 0;
 	struct vm_area_struct *vma = walk->vma;
 	const struct mm_walk_ops *ops = walk->ops;
+	bool is_hugetlb = is_vm_hugetlb_page(vma);
+
+	/* We do not support hugetlb PTE installation. */
+	if (ops->install_pte && is_hugetlb)
+		return -EINVAL;
 
 	if (ops->pre_vma) {
 		err = ops->pre_vma(start, end, walk);
@@ -341,7 +386,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
 			return err;
 	}
 
-	if (is_vm_hugetlb_page(vma)) {
+	if (is_hugetlb) {
 		if (ops->hugetlb_entry)
 			err = walk_hugetlb_range(start, end, walk);
 	} else
@@ -380,47 +425,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
 #endif
 }
 
-/**
- * walk_page_range - walk page table with caller specific callbacks
- * @mm:		mm_struct representing the target process of page table walk
- * @start:	start address of the virtual address range
- * @end:	end address of the virtual address range
- * @ops:	operation to call during the walk
- * @private:	private data for callbacks' usage
- *
- * Recursively walk the page table tree of the process represented by @mm
- * within the virtual address range [@start, @end). During walking, we can do
- * some caller-specific works for each entry, by setting up pmd_entry(),
- * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
- * callbacks, the associated entries/pages are just ignored.
- * The return values of these callbacks are commonly defined like below:
- *
- *  - 0  : succeeded to handle the current entry, and if you don't reach the
- *         end address yet, continue to walk.
- *  - >0 : succeeded to handle the current entry, and return to the caller
- *         with caller specific value.
- *  - <0 : failed to handle the current entry, and return to the caller
- *         with error code.
- *
- * Before starting to walk page table, some callers want to check whether
- * they really want to walk over the current vma, typically by checking
- * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
- * purpose.
- *
- * If operations need to be staged before and committed after a vma is walked,
- * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
- * since it is intended to handle commit-type operations, can't return any
- * errors.
- *
- * struct mm_walk keeps current values of some common data like vma and pmd,
- * which are useful for the access from callbacks. If you want to pass some
- * caller-specific data to callbacks, @private should be helpful.
+/*
+ * See the comment for walk_page_range(), this performs the heavy lifting of the
+ * operation, only sets no restrictions on how the walk proceeds.
  *
- * Locking:
- *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
- *   because these function traverse vma list and/or access to vma's data.
+ * We usually restrict the ability to install PTEs, but this functionality is
+ * available to internal memory management code and provided in mm/internal.h.
  */
-int walk_page_range(struct mm_struct *mm, unsigned long start,
+int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private)
 {
@@ -479,6 +491,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
 	return err;
 }
 
+/*
+ * Determine if the walk operations specified are permitted to be used for a
+ * page table walk.
+ *
+ * This check is performed on all functions which are parameterised by walk
+ * operations and exposed in include/linux/pagewalk.h.
+ *
+ * Internal memory management code can use the walk_page_range_mm() function to
+ * be able to use all page walking operations.
+ */
+static bool check_ops_valid(const struct mm_walk_ops *ops)
+{
+	/*
+	 * The installation of PTEs is solely under the control of memory
+	 * management logic and subject to many subtle locking, security and
+	 * cache considerations so we cannot permit other users to do so, and
+	 * certainly not for exported symbols.
+	 */
+	if (ops->install_pte)
+		return false;
+
+	return true;
+}
+
+/**
+ * walk_page_range - walk page table with caller specific callbacks
+ * @mm:		mm_struct representing the target process of page table walk
+ * @start:	start address of the virtual address range
+ * @end:	end address of the virtual address range
+ * @ops:	operation to call during the walk
+ * @private:	private data for callbacks' usage
+ *
+ * Recursively walk the page table tree of the process represented by @mm
+ * within the virtual address range [@start, @end). During walking, we can do
+ * some caller-specific works for each entry, by setting up pmd_entry(),
+ * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
+ * callbacks, the associated entries/pages are just ignored.
+ * The return values of these callbacks are commonly defined like below:
+ *
+ *  - 0  : succeeded to handle the current entry, and if you don't reach the
+ *         end address yet, continue to walk.
+ *  - >0 : succeeded to handle the current entry, and return to the caller
+ *         with caller specific value.
+ *  - <0 : failed to handle the current entry, and return to the caller
+ *         with error code.
+ *
+ * Before starting to walk page table, some callers want to check whether
+ * they really want to walk over the current vma, typically by checking
+ * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
+ * purpose.
+ *
+ * If operations need to be staged before and committed after a vma is walked,
+ * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
+ * since it is intended to handle commit-type operations, can't return any
+ * errors.
+ *
+ * struct mm_walk keeps current values of some common data like vma and pmd,
+ * which are useful for the access from callbacks. If you want to pass some
+ * caller-specific data to callbacks, @private should be helpful.
+ *
+ * Locking:
+ *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
+ *   because these function traverse vma list and/or access to vma's data.
+ */
+int walk_page_range(struct mm_struct *mm, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		void *private)
+{
+	if (!check_ops_valid(ops))
+		return -EINVAL;
+
+	return walk_page_range_mm(mm, start, end, ops, private);
+}
+
 /**
  * walk_page_range_novma - walk a range of pagetables not backed by a vma
  * @mm:		mm_struct representing the target process of page table walk
@@ -494,7 +580,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
  * walking the kernel pages tables or page tables for firmware.
  *
  * Note: Be careful to walk the kernel pages tables, the caller may be need to
- * take other effective approache (mmap lock may be insufficient) to prevent
+ * take other effective approaches (mmap lock may be insufficient) to prevent
  * the intermediate kernel page tables belonging to the specified address range
  * from being freed (e.g. memory hot-remove).
  */
@@ -513,6 +599,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 
 	if (start >= end || !walk.mm)
 		return -EINVAL;
+	if (!check_ops_valid(ops))
+		return -EINVAL;
 
 	/*
 	 * 1) For walking the user virtual address space:
@@ -556,6 +644,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 		return -EINVAL;
 	if (start < vma->vm_start || end > vma->vm_end)
 		return -EINVAL;
+	if (!check_ops_valid(ops))
+		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
 	process_vma_walk_lock(vma, ops->walk_lock);
@@ -574,6 +664,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 
 	if (!walk.mm)
 		return -EINVAL;
+	if (!check_ops_valid(ops))
+		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
 	process_vma_walk_lock(vma, ops->walk_lock);
@@ -623,6 +715,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 	unsigned long start_addr, end_addr;
 	int err = 0;
 
+	if (!check_ops_valid(ops))
+		return -EINVAL;
+
 	lockdep_assert_held(&mapping->i_mmap_rwsem);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
 				  first_index + nr - 1) {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 2/5] mm: add PTE_MARKER_GUARD PTE marker
  2024-10-23 16:24 [PATCH v3 0/5] implement lightweight guard pages Lorenzo Stoakes
  2024-10-23 16:24 ` [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
@ 2024-10-23 16:24 ` Lorenzo Stoakes
  2024-10-28 20:34   ` Jarkko Sakkinen
  2024-10-23 16:24 ` [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

Add a new PTE marker that results in any access causing the accessing
process to segfault.

This is preferable to PTE_MARKER_POISONED, which results in the same
handling as hardware poisoned memory, and is thus undesirable for cases
where we simply wish to 'soft' poison a range.

This is in preparation for implementing the ability to specify guard pages
at the page table level, i.e. ranges that, when accessed, should cause
process termination.

Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
purpose was simply incorrect.

We then reuse the same logic to determine whether a zap should clear a
guard entry - this should only be performed on teardown and never on
MADV_DONTNEED or MADV_FREE.

We additionally add a WARN_ON_ONCE() in hugetlb logic should a guard marker
be encountered there, as we explicitly do not support this operation and
this should not occur.

Acked-by: Vlastimil Babka <vbabkba@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm_inline.h |  2 +-
 include/linux/swapops.h   | 24 +++++++++++++++++++++++-
 mm/hugetlb.c              |  4 ++++
 mm/memory.c               | 18 +++++++++++++++---
 mm/mprotect.c             |  6 ++++--
 5 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355cf46a01a6..1b6a917fffa4 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -544,7 +544,7 @@ static inline pte_marker copy_pte_marker(
 {
 	pte_marker srcm = pte_marker_get(entry);
 	/* Always copy error entries. */
-	pte_marker dstm = srcm & PTE_MARKER_POISONED;
+	pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD);
 
 	/* Only copy PTE markers if UFFD register matches. */
 	if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma))
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index cb468e418ea1..96f26e29fefe 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -426,9 +426,19 @@ typedef unsigned long pte_marker;
  * "Poisoned" here is meant in the very general sense of "future accesses are
  * invalid", instead of referring very specifically to hardware memory errors.
  * This marker is meant to represent any of various different causes of this.
+ *
+ * Note that, when encountered by the faulting logic, PTEs with this marker will
+ * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
+ * logic.
  */
 #define  PTE_MARKER_POISONED			BIT(1)
-#define  PTE_MARKER_MASK			(BIT(2) - 1)
+/*
+ * Indicates that, on fault, this PTE will case a SIGSEGV signal to be
+ * sent. This means guard markers behave in effect as if the region were mapped
+ * PROT_NONE, rather than if they were a memory hole or equivalent.
+ */
+#define  PTE_MARKER_GUARD			BIT(2)
+#define  PTE_MARKER_MASK			(BIT(3) - 1)
 
 static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
 {
@@ -464,6 +474,18 @@ static inline int is_poisoned_swp_entry(swp_entry_t entry)
 {
 	return is_pte_marker_entry(entry) &&
 	    (pte_marker_get(entry) & PTE_MARKER_POISONED);
+
+}
+
+static inline swp_entry_t make_guard_swp_entry(void)
+{
+	return make_pte_marker_entry(PTE_MARKER_GUARD);
+}
+
+static inline int is_guard_swp_entry(swp_entry_t entry)
+{
+	return is_pte_marker_entry(entry) &&
+		(pte_marker_get(entry) & PTE_MARKER_GUARD);
 }
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 906294ac85dc..2c8c5da0f5d3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6353,6 +6353,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 				ret = VM_FAULT_HWPOISON_LARGE |
 				      VM_FAULT_SET_HINDEX(hstate_index(h));
 				goto out_mutex;
+			} else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) {
+				/* This isn't supported in hugetlb. */
+				ret = VM_FAULT_SIGSEGV;
+				goto out_mutex;
 			}
 		}
 
diff --git a/mm/memory.c b/mm/memory.c
index 0f614523b9f4..551455cd453f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details,
 	return !folio_test_anon(folio);
 }
 
-static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
+static inline bool zap_drop_markers(struct zap_details *details)
 {
 	if (!details)
 		return false;
@@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 	if (vma_is_anonymous(vma))
 		return;
 
-	if (zap_drop_file_uffd_wp(details))
+	if (zap_drop_markers(details))
 		return;
 
 	for (;;) {
@@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			 * drop the marker if explicitly requested.
 			 */
 			if (!vma_is_anonymous(vma) &&
-			    !zap_drop_file_uffd_wp(details))
+			    !zap_drop_markers(details))
+				continue;
+		} else if (is_guard_swp_entry(entry)) {
+			/*
+			 * Ordinary zapping should not remove guard PTE
+			 * markers. Only do so if we should remove PTE markers
+			 * in general.
+			 */
+			if (!zap_drop_markers(details))
 				continue;
 		} else if (is_hwpoison_entry(entry) ||
 			   is_poisoned_swp_entry(entry)) {
@@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	if (marker & PTE_MARKER_POISONED)
 		return VM_FAULT_HWPOISON;
 
+	/* Hitting a guard page is always a fatal condition. */
+	if (marker & PTE_MARKER_GUARD)
+		return VM_FAULT_SIGSEGV;
+
 	if (pte_marker_entry_uffd_wp(entry))
 		return pte_marker_handle_uffd_wp(vmf);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0c5d6d06107d..1f671b0667bd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -236,9 +236,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 			} else if (is_pte_marker_entry(entry)) {
 				/*
 				 * Ignore error swap entries unconditionally,
-				 * because any access should sigbus anyway.
+				 * because any access should sigbus/sigsegv
+				 * anyway.
 				 */
-				if (is_poisoned_swp_entry(entry))
+				if (is_poisoned_swp_entry(entry) ||
+				    is_guard_swp_entry(entry))
 					continue;
 				/*
 				 * If this is uffd-wp pte marker and we'd like
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-23 16:24 [PATCH v3 0/5] implement lightweight guard pages Lorenzo Stoakes
  2024-10-23 16:24 ` [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
  2024-10-23 16:24 ` [PATCH v3 2/5] mm: add PTE_MARKER_GUARD PTE marker Lorenzo Stoakes
@ 2024-10-23 16:24 ` Lorenzo Stoakes
  2024-10-23 23:12   ` Andrew Morton
                     ` (3 more replies)
  2024-10-23 16:24 ` [PATCH v3 4/5] tools: testing: update tools UAPI header for mman-common.h Lorenzo Stoakes
  2024-10-23 16:24 ` [PATCH v3 5/5] selftests/mm: add self tests for guard page feature Lorenzo Stoakes
  4 siblings, 4 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

Implement a new lightweight guard page feature, that is regions of userland
virtual memory that, when accessed, cause a fatal signal to arise.

Currently users must establish PROT_NONE ranges to achieve this.

However this is very costly memory-wise - we need a VMA for each and every
one of these regions AND they become unmergeable with surrounding VMAs.

In addition repeated mmap() calls require repeated kernel context switches
and contention of the mmap lock to install these ranges, potentially also
having to unmap memory if installed over existing ranges.

The lightweight guard approach eliminates the VMA cost altogether - rather
than establishing a PROT_NONE VMA, it operates at the level of page table
entries - establishing PTE markers such that accesses to them cause a fault
followed by a SIGSGEV signal being raised.

This is achieved through the PTE marker mechanism, which we have already
extended to provide PTE_MARKER_GUARD, which we installed via the generic
page walking logic which we have extended for this purpose.

These guard ranges are established with MADV_GUARD_INSTALL. If the range in
which they are installed contain any existing mappings, they will be
zapped, i.e. free the range and unmap memory (thus mimicking the behaviour
of MADV_DONTNEED in this respect).

Any existing guard entries will be left untouched. There is therefore no
nesting of guarded pages.

Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both
instances the memory range may be reused at which point a user would expect
guards to still be in place), but they are cleared via MADV_GUARD_REMOVE,
process teardown or unmapping of memory ranges.

The guard property can be removed from ranges via MADV_GUARD_REMOVE. The
ranges over which this is applied, should they contain non-guard entries,
will be untouched, with only guard entries being cleared.

We permit this operation on anonymous memory only, and only VMAs which are
non-special, non-huge and not mlock()'d (if we permitted this we'd have to
drop locked pages which would be rather counterintuitive).

Racing page faults can cause repeated attempts to install guard pages that
are interrupted, result in a zap, and this process can end up being
repeated. If this happens more than would be expected in normal operation,
we rescind locks and retry the whole thing, which avoids lock contention in
this scenario.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 arch/alpha/include/uapi/asm/mman.h     |   3 +
 arch/mips/include/uapi/asm/mman.h      |   3 +
 arch/parisc/include/uapi/asm/mman.h    |   3 +
 arch/xtensa/include/uapi/asm/mman.h    |   3 +
 include/uapi/asm-generic/mman-common.h |   3 +
 mm/internal.h                          |   6 +
 mm/madvise.c                           | 225 +++++++++++++++++++++++++
 mm/mseal.c                             |   1 +
 8 files changed, 247 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 763929e814e9..1e700468a685 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -78,6 +78,9 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
+#define MADV_GUARD_REMOVE 103		/* unguard range */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 9c48d9a21aa0..b700dae28c48 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -105,6 +105,9 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
+#define MADV_GUARD_REMOVE 103		/* unguard range */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 68c44f99bc93..b6a709506987 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -75,6 +75,9 @@
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
+#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
+#define MADV_GUARD_REMOVE 103		/* unguard range */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1ff0c858544f..99d4ccee7f6e 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -113,6 +113,9 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
+#define MADV_GUARD_REMOVE 103		/* unguard range */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..1ea2c4c33b86 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,9 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
+#define MADV_GUARD_REMOVE 103		/* unguard range */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/internal.h b/mm/internal.h
index fb1fb0c984e4..fcf08b5e64dc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -423,6 +423,12 @@ extern unsigned long highest_memmap_pfn;
  */
 #define MAX_RECLAIM_RETRIES 16
 
+/*
+ * Maximum number of attempts we make to install guard pages before we give up
+ * and return -ERESTARTNOINTR to have userspace try again.
+ */
+#define MAX_MADVISE_GUARD_RETRIES 3
+
 /*
  * in mm/vmscan.c:
  */
diff --git a/mm/madvise.c b/mm/madvise.c
index e871a72a6c32..48eba25e25fe 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -60,6 +60,8 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
 	case MADV_COLLAPSE:
+	case MADV_GUARD_INSTALL:
+	case MADV_GUARD_REMOVE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1017,6 +1019,214 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
+{
+	vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB;
+
+	/*
+	 * A user could lock after setting a guard range but that's fine, as
+	 * they'd not be able to fault in. The issue arises when we try to zap
+	 * existing locked VMAs. We don't want to do that.
+	 */
+	if (!allow_locked)
+		disallowed |= VM_LOCKED;
+
+	if (!vma_is_anonymous(vma))
+		return false;
+
+	if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
+		return false;
+
+	return true;
+}
+
+static bool is_guard_pte_marker(pte_t ptent)
+{
+	return is_pte_marker(ptent) &&
+		is_guard_swp_entry(pte_to_swp_entry(ptent));
+}
+
+static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
+				   unsigned long next, struct mm_walk *walk)
+{
+	pud_t pudval = pudp_get(pud);
+
+	/* If huge return >0 so we abort the operation + zap. */
+	return pud_trans_huge(pudval) || pud_devmap(pudval);
+}
+
+static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
+				   unsigned long next, struct mm_walk *walk)
+{
+	pmd_t pmdval = pmdp_get(pmd);
+
+	/* If huge return >0 so we abort the operation + zap. */
+	return pmd_trans_huge(pmdval) || pmd_devmap(pmdval);
+}
+
+static int guard_install_pte_entry(pte_t *pte, unsigned long addr,
+				   unsigned long next, struct mm_walk *walk)
+{
+	pte_t pteval = ptep_get(pte);
+	unsigned long *nr_pages = (unsigned long *)walk->private;
+
+	/* If there is already a guard page marker, we have nothing to do. */
+	if (is_guard_pte_marker(pteval)) {
+		(*nr_pages)++;
+
+		return 0;
+	}
+
+	/* If populated return >0 so we abort the operation + zap. */
+	return 1;
+}
+
+static int guard_install_set_pte(unsigned long addr, unsigned long next,
+				 pte_t *ptep, struct mm_walk *walk)
+{
+	unsigned long *nr_pages = (unsigned long *)walk->private;
+
+	/* Simply install a PTE marker, this causes segfault on access. */
+	*ptep = make_pte_marker(PTE_MARKER_GUARD);
+	(*nr_pages)++;
+
+	return 0;
+}
+
+static const struct mm_walk_ops guard_install_walk_ops = {
+	.pud_entry		= guard_install_pud_entry,
+	.pmd_entry		= guard_install_pmd_entry,
+	.pte_entry		= guard_install_pte_entry,
+	.install_pte		= guard_install_set_pte,
+	.walk_lock		= PGWALK_RDLOCK,
+};
+
+static long madvise_guard_install(struct vm_area_struct *vma,
+				 struct vm_area_struct **prev,
+				 unsigned long start, unsigned long end)
+{
+	long err;
+	int i;
+
+	*prev = vma;
+	if (!is_valid_guard_vma(vma, /* allow_locked = */false))
+		return -EINVAL;
+
+	/*
+	 * If we install guard markers, then the range is no longer
+	 * empty from a page table perspective and therefore it's
+	 * appropriate to have an anon_vma.
+	 *
+	 * This ensures that on fork, we copy page tables correctly.
+	 */
+	err = anon_vma_prepare(vma);
+	if (err)
+		return err;
+
+	/*
+	 * Optimistically try to install the guard marker pages first. If any
+	 * non-guard pages are encountered, give up and zap the range before
+	 * trying again.
+	 *
+	 * We try a few times before giving up and releasing back to userland to
+	 * loop around, releasing locks in the process to avoid contention. This
+	 * would only happen if there was a great many racing page faults.
+	 *
+	 * In most cases we should simply install the guard markers immediately
+	 * with no zap or looping.
+	 */
+	for (i = 0; i < MAX_MADVISE_GUARD_RETRIES; i++) {
+		unsigned long nr_pages = 0;
+
+		/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
+		err = walk_page_range_mm(vma->vm_mm, start, end,
+					 &guard_install_walk_ops, &nr_pages);
+		if (err < 0)
+			return err;
+
+		if (err == 0) {
+			unsigned long nr_expected_pages = PHYS_PFN(end - start);
+
+			VM_WARN_ON(nr_pages != nr_expected_pages);
+			return 0;
+		}
+
+		/*
+		 * OK some of the range have non-guard pages mapped, zap
+		 * them. This leaves existing guard pages in place.
+		 */
+		zap_page_range_single(vma, start, end - start, NULL);
+	}
+
+	/*
+	 * We were unable to install the guard pages due to being raced by page
+	 * faults. This should not happen ordinarily. We return to userspace and
+	 * immediately retry, relieving lock contention.
+	 */
+	return -ERESTARTNOINTR;
+}
+
+static int guard_remove_pud_entry(pud_t *pud, unsigned long addr,
+				  unsigned long next, struct mm_walk *walk)
+{
+	pud_t pudval = pudp_get(pud);
+
+	/* If huge, cannot have guard pages present, so no-op - skip. */
+	if (pud_trans_huge(pudval) || pud_devmap(pudval))
+		walk->action = ACTION_CONTINUE;
+
+	return 0;
+}
+
+static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr,
+				  unsigned long next, struct mm_walk *walk)
+{
+	pmd_t pmdval = pmdp_get(pmd);
+
+	/* If huge, cannot have guard pages present, so no-op - skip. */
+	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
+		walk->action = ACTION_CONTINUE;
+
+	return 0;
+}
+
+static int guard_remove_pte_entry(pte_t *pte, unsigned long addr,
+				  unsigned long next, struct mm_walk *walk)
+{
+	pte_t ptent = ptep_get(pte);
+
+	if (is_guard_pte_marker(ptent)) {
+		/* Simply clear the PTE marker. */
+		pte_clear_not_present_full(walk->mm, addr, pte, false);
+		update_mmu_cache(walk->vma, addr, pte);
+	}
+
+	return 0;
+}
+
+static const struct mm_walk_ops guard_remove_walk_ops = {
+	.pud_entry		= guard_remove_pud_entry,
+	.pmd_entry		= guard_remove_pmd_entry,
+	.pte_entry		= guard_remove_pte_entry,
+	.walk_lock		= PGWALK_RDLOCK,
+};
+
+static long madvise_guard_remove(struct vm_area_struct *vma,
+				 struct vm_area_struct **prev,
+				 unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	/*
+	 * We're ok with removing guards in mlock()'d ranges, as this is a
+	 * non-destructive action.
+	 */
+	if (!is_valid_guard_vma(vma, /* allow_locked = */true))
+		return -EINVAL;
+
+	return walk_page_range(vma->vm_mm, start, end,
+			       &guard_remove_walk_ops, NULL);
+}
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1098,6 +1308,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		break;
 	case MADV_COLLAPSE:
 		return madvise_collapse(vma, prev, start, end);
+	case MADV_GUARD_INSTALL:
+		return madvise_guard_install(vma, prev, start, end);
+	case MADV_GUARD_REMOVE:
+		return madvise_guard_remove(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1197,6 +1411,8 @@ madvise_behavior_valid(int behavior)
 	case MADV_DODUMP:
 	case MADV_WIPEONFORK:
 	case MADV_KEEPONFORK:
+	case MADV_GUARD_INSTALL:
+	case MADV_GUARD_REMOVE:
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
@@ -1490,6 +1706,15 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 	while (iov_iter_count(iter)) {
 		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
 				 iter_iov_len(iter), behavior);
+		/*
+		 * We cannot return this, as we instead return the number of
+		 * successful operations. Since all this would achieve in a
+		 * single madvise() invocation is to re-enter the syscall, and
+		 * we have already rescinded locks, it should be no problem to
+		 * simply try again.
+		 */
+		if (ret == -ERESTARTNOINTR)
+			continue;
 		if (ret < 0)
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
diff --git a/mm/mseal.c b/mm/mseal.c
index ece977bd21e1..81d6e980e8a9 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior)
 	case MADV_REMOVE:
 	case MADV_DONTFORK:
 	case MADV_WIPEONFORK:
+	case MADV_GUARD_INSTALL:
 		return true;
 	}
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 4/5] tools: testing: update tools UAPI header for mman-common.h
  2024-10-23 16:24 [PATCH v3 0/5] implement lightweight guard pages Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2024-10-23 16:24 ` [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
@ 2024-10-23 16:24 ` Lorenzo Stoakes
  2024-10-23 16:24 ` [PATCH v3 5/5] selftests/mm: add self tests for guard page feature Lorenzo Stoakes
  4 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

Import the new MADV_GUARD_INSTALL/REMOVE madvise flags.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/include/uapi/asm-generic/mman-common.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..1ea2c4c33b86 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,9 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
+#define MADV_GUARD_REMOVE 103		/* unguard range */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 5/5] selftests/mm: add self tests for guard page feature
  2024-10-23 16:24 [PATCH v3 0/5] implement lightweight guard pages Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2024-10-23 16:24 ` [PATCH v3 4/5] tools: testing: update tools UAPI header for mman-common.h Lorenzo Stoakes
@ 2024-10-23 16:24 ` Lorenzo Stoakes
  2024-10-28 20:32   ` Jarkko Sakkinen
  4 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

Utilise the kselftest harmness to implement tests for the guard page
implementation.

We start by implement basic tests asserting that guard pages can be
installed, removed and that touching guard pages result in SIGSEGV. We also
assert that, in removing guard pages from a range, non-guard pages remain
intact.

We then examine different operations on regions containing guard markers
behave to ensure correct behaviour:

* Operations over multiple VMAs operate as expected.
* Invoking MADV_GUARD_INSTALL / MADV_GUARD_REMOVE via process_madvise() in
  batches works correctly.
* Ensuring that munmap() correctly tears down guard markers.
* Using mprotect() to adjust protection bits does not in any way override
  or cause issues with guard markers.
* Ensuring that splitting and merging VMAs around guard markers causes no
  issue - i.e. that a marker which 'belongs' to one VMA can function just
  as well 'belonging' to another.
* Ensuring that madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE)
  do not remove guard markers.
* Ensuring that mlock()'ing a range containing guard markers does not
  cause issues.
* Ensuring that mremap() can move a guard range and retain guard markers.
* Ensuring that mremap() can expand a guard range and retain guard
  markers (perhaps moving the range).
* Ensuring that mremap() can shrink a guard range and retain guard markers.
* Ensuring that forking a process correctly retains guard markers.
* Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
* Ensuring that lazyfree simply clears guard markers.
* Ensuring that userfaultfd can co-exist with guard pages.
* Ensuring that madvise(..., MADV_POPULATE_READ) and
  madvise(..., MADV_POPULATE_WRITE) error out when encountering
  guard markers.
* Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do
  not remove guard markers.

If any test is unable to be run due to lack of permissions, that test is
skipped.

Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/.gitignore    |    1 +
 tools/testing/selftests/mm/Makefile      |    1 +
 tools/testing/selftests/mm/guard-pages.c | 1239 ++++++++++++++++++++++
 3 files changed, 1241 insertions(+)
 create mode 100644 tools/testing/selftests/mm/guard-pages.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 689bbd520296..8f01f4da1c0d 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -54,3 +54,4 @@ droppable
 hugetlb_dio
 pkey_sighandler_tests_32
 pkey_sighandler_tests_64
+guard-pages
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 02e1204971b0..15c734d6cfec 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
 TEST_GEN_FILES += hugetlb_madv_vs_map
 TEST_GEN_FILES += hugetlb_dio
 TEST_GEN_FILES += droppable
+TEST_GEN_FILES += guard-pages
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
new file mode 100644
index 000000000000..7db9c913e9db
--- /dev/null
+++ b/tools/testing/selftests/mm/guard-pages.c
@@ -0,0 +1,1239 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#define _GNU_SOURCE
+#include "../kselftest_harness.h"
+#include <asm-generic/mman.h> /* Force the import of the tools version. */
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/userfaultfd.h>
+#include <setjmp.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/uio.h>
+#include <unistd.h>
+
+/*
+ * Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1:
+ *
+ * "If the signal occurs other than as the result of calling the abort or raise
+ *  function, the behavior is undefined if the signal handler refers to any
+ *  object with static storage duration other than by assigning a value to an
+ *  object declared as volatile sig_atomic_t"
+ */
+static volatile sig_atomic_t signal_jump_set;
+static sigjmp_buf signal_jmp_buf;
+
+/*
+ * Ignore the checkpatch warning, we must read from x but don't want to do
+ * anything with it in order to trigger a read page fault. We therefore must use
+ * volatile to stop the compiler from optimising this away.
+ */
+#define FORCE_READ(x) (*(volatile typeof(x) *)x)
+
+static int userfaultfd(int flags)
+{
+	return syscall(SYS_userfaultfd, flags);
+}
+
+static void handle_fatal(int c)
+{
+	if (!signal_jump_set)
+		return;
+
+	siglongjmp(signal_jmp_buf, c);
+}
+
+static int pidfd_open(pid_t pid, unsigned int flags)
+{
+	return syscall(SYS_pidfd_open, pid, flags);
+}
+
+/*
+ * Enable our signal catcher and try to read/write the specified buffer. The
+ * return value indicates whether the read/write succeeds without a fatal
+ * signal.
+ */
+static bool try_access_buf(char *ptr, bool write)
+{
+	bool failed;
+
+	/* Tell signal handler to jump back here on fatal signal. */
+	signal_jump_set = true;
+	/* If a fatal signal arose, we will jump back here and failed is set. */
+	failed = sigsetjmp(signal_jmp_buf, 0) != 0;
+
+	if (!failed) {
+		if (write)
+			*ptr = 'x';
+		else
+			FORCE_READ(ptr);
+	}
+
+	signal_jump_set = false;
+	return !failed;
+}
+
+/* Try and read from a buffer, return true if no fatal signal. */
+static bool try_read_buf(char *ptr)
+{
+	return try_access_buf(ptr, false);
+}
+
+/* Try and write to a buffer, return true if no fatal signal. */
+static bool try_write_buf(char *ptr)
+{
+	return try_access_buf(ptr, true);
+}
+
+/*
+ * Try and BOTH read from AND write to a buffer, return true if BOTH operations
+ * succeed.
+ */
+static bool try_read_write_buf(char *ptr)
+{
+	return try_read_buf(ptr) && try_write_buf(ptr);
+}
+
+FIXTURE(guard_pages)
+{
+	unsigned long page_size;
+};
+
+FIXTURE_SETUP(guard_pages)
+{
+	struct sigaction act = {
+		.sa_handler = &handle_fatal,
+		.sa_flags = SA_NODEFER,
+	};
+
+	sigemptyset(&act.sa_mask);
+	if (sigaction(SIGSEGV, &act, NULL))
+		ksft_exit_fail_perror("sigaction");
+
+	self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
+};
+
+FIXTURE_TEARDOWN(guard_pages)
+{
+	struct sigaction act = {
+		.sa_handler = SIG_DFL,
+		.sa_flags = SA_NODEFER,
+	};
+
+	sigemptyset(&act.sa_mask);
+	sigaction(SIGSEGV, &act, NULL);
+}
+
+TEST_F(guard_pages, basic)
+{
+	const unsigned long NUM_PAGES = 10;
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	int i;
+
+	ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANON, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Trivially assert we can touch the first page. */
+	ASSERT_TRUE(try_read_write_buf(ptr));
+
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Establish that 1st page SIGSEGV's. */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+
+	/* Ensure we can touch everything else.*/
+	for (i = 1; i < NUM_PAGES; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/* Establish a guard page at the end of the mapping. */
+	ASSERT_EQ(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
+			  MADV_GUARD_INSTALL), 0);
+
+	/* Check that both guard pages result in SIGSEGV. */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+	ASSERT_FALSE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
+
+	/* Remove the first guard page. */
+	ASSERT_FALSE(madvise(ptr, page_size, MADV_GUARD_REMOVE));
+
+	/* Make sure we can touch it. */
+	ASSERT_TRUE(try_read_write_buf(ptr));
+
+	/* Remove the last guard page. */
+	ASSERT_FALSE(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
+			     MADV_GUARD_REMOVE));
+
+	/* Make sure we can touch it. */
+	ASSERT_TRUE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
+
+	/*
+	 *  Test setting a _range_ of pages, namely the first 3. The first of
+	 *  these be faulted in, so this also tests that we can install guard
+	 *  pages over backed pages.
+	 */
+	ASSERT_EQ(madvise(ptr, 3 * page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Make sure they are all guard pages. */
+	for (i = 0; i < 3; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Make sure the rest are not. */
+	for (i = 3; i < NUM_PAGES; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/* Remove guard pages. */
+	ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_REMOVE), 0);
+
+	/* Now make sure we can touch everything. */
+	for (i = 0; i < NUM_PAGES; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/*
+	 * Now remove all guard pages, make sure we don't remove existing
+	 * entries.
+	 */
+	ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_REMOVE), 0);
+
+	for (i = 0; i < NUM_PAGES * page_size; i += page_size) {
+		char chr = ptr[i];
+
+		ASSERT_EQ(chr, 'x');
+	}
+
+	ASSERT_EQ(munmap(ptr, NUM_PAGES * page_size), 0);
+}
+
+/* Assert that operations applied across multiple VMAs work as expected. */
+TEST_F(guard_pages, multi_vma)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3;
+	int i;
+
+	/* Reserve a 100 page region over which we can install VMAs. */
+	ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
+			  MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr_region, MAP_FAILED);
+
+	/* Place a VMA of 10 pages size at the start of the region. */
+	ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE,
+		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr1, MAP_FAILED);
+
+	/* Place a VMA of 5 pages size 50 pages into the region. */
+	ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
+		    PROT_READ | PROT_WRITE,
+		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/* Place a VMA of 20 pages size at the end of the region. */
+	ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size,
+		    PROT_READ | PROT_WRITE,
+		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr3, MAP_FAILED);
+
+	/* Unmap gaps. */
+	ASSERT_EQ(munmap(&ptr_region[10 * page_size], 40 * page_size), 0);
+	ASSERT_EQ(munmap(&ptr_region[55 * page_size], 25 * page_size), 0);
+
+	/*
+	 * We end up with VMAs like this:
+	 *
+	 * 0    10 .. 50   55 .. 80   100
+	 * [---]      [---]      [---]
+	 */
+
+	/*
+	 * Now mark the whole range as guard pages and make sure all VMAs are as
+	 * such.
+	 */
+
+	/*
+	 * madvise() is certifiable and lets you perform operations over gaps,
+	 * everything works, but it indicates an error and errno is set to
+	 * -ENOMEM. Also if anything runs out of memory it is set to
+	 * -ENOMEM. You are meant to guess which is which.
+	 */
+	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_INSTALL), -1);
+	ASSERT_EQ(errno, ENOMEM);
+
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr1[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	for (i = 0; i < 5; i++) {
+		char *curr = &ptr2[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	for (i = 0; i < 20; i++) {
+		char *curr = &ptr3[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Now remove guar pages over range and assert the opposite. */
+
+	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_REMOVE), -1);
+	ASSERT_EQ(errno, ENOMEM);
+
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr1[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	for (i = 0; i < 5; i++) {
+		char *curr = &ptr2[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	for (i = 0; i < 20; i++) {
+		char *curr = &ptr3[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/* Now map incompatible VMAs in the gaps. */
+	ptr = mmap(&ptr_region[10 * page_size], 40 * page_size,
+		   PROT_READ | PROT_WRITE | PROT_EXEC,
+		   MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	ptr = mmap(&ptr_region[55 * page_size], 25 * page_size,
+		   PROT_READ | PROT_WRITE | PROT_EXEC,
+		   MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/*
+	 * We end up with VMAs like this:
+	 *
+	 * 0    10 .. 50   55 .. 80   100
+	 * [---][xxxx][---][xxxx][---]
+	 *
+	 * Where 'x' signifies VMAs that cannot be merged with those adjacent to
+	 * them.
+	 */
+
+	/* Multiple VMAs adjacent to one another should result in no error. */
+	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_INSTALL), 0);
+	for (i = 0; i < 100; i++) {
+		char *curr = &ptr_region[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_REMOVE), 0);
+	for (i = 0; i < 100; i++) {
+		char *curr = &ptr_region[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr_region, 100 * page_size), 0);
+}
+
+/*
+ * Assert that batched operations performed using process_madvise() work as
+ * expected.
+ */
+TEST_F(guard_pages, process_madvise)
+{
+	const unsigned long page_size = self->page_size;
+	pid_t pid = getpid();
+	int pidfd = pidfd_open(pid, 0);
+	char *ptr_region, *ptr1, *ptr2, *ptr3;
+	ssize_t count;
+	struct iovec vec[6];
+
+	ASSERT_NE(pidfd, -1);
+
+	/* Reserve region to map over. */
+	ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
+			  MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr_region, MAP_FAILED);
+
+	/* 10 pages offset 1 page into reserve region. */
+	ptr1 = mmap(&ptr_region[page_size], 10 * page_size,
+		    PROT_READ | PROT_WRITE,
+		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr1, MAP_FAILED);
+	/* We want guard markers at start/end of each VMA. */
+	vec[0].iov_base = ptr1;
+	vec[0].iov_len = page_size;
+	vec[1].iov_base = &ptr1[9 * page_size];
+	vec[1].iov_len = page_size;
+
+	/* 5 pages offset 50 pages into reserve region. */
+	ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
+		    PROT_READ | PROT_WRITE,
+		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr2, MAP_FAILED);
+	vec[2].iov_base = ptr2;
+	vec[2].iov_len = page_size;
+	vec[3].iov_base = &ptr2[4 * page_size];
+	vec[3].iov_len = page_size;
+
+	/* 20 pages offset 79 pages into reserve region. */
+	ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size,
+		    PROT_READ | PROT_WRITE,
+		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr3, MAP_FAILED);
+	vec[4].iov_base = ptr3;
+	vec[4].iov_len = page_size;
+	vec[5].iov_base = &ptr3[19 * page_size];
+	vec[5].iov_len = page_size;
+
+	/* Free surrounding VMAs. */
+	ASSERT_EQ(munmap(ptr_region, page_size), 0);
+	ASSERT_EQ(munmap(&ptr_region[11 * page_size], 39 * page_size), 0);
+	ASSERT_EQ(munmap(&ptr_region[55 * page_size], 24 * page_size), 0);
+	ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0);
+
+	/* Now guard in one step. */
+	count = process_madvise(pidfd, vec, 6, MADV_GUARD_INSTALL, 0);
+
+	/* OK we don't have permission to do this, skip. */
+	if (count == -1 && errno == EPERM)
+		ksft_exit_skip("No process_madvise() permissions, try running as root.\n");
+
+	/* Returns the number of bytes advised. */
+	ASSERT_EQ(count, 6 * page_size);
+
+	/* Now make sure the guarding was applied. */
+
+	ASSERT_FALSE(try_read_write_buf(ptr1));
+	ASSERT_FALSE(try_read_write_buf(&ptr1[9 * page_size]));
+
+	ASSERT_FALSE(try_read_write_buf(ptr2));
+	ASSERT_FALSE(try_read_write_buf(&ptr2[4 * page_size]));
+
+	ASSERT_FALSE(try_read_write_buf(ptr3));
+	ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size]));
+
+	/* Now do the same with unguard... */
+	count = process_madvise(pidfd, vec, 6, MADV_GUARD_REMOVE, 0);
+
+	/* ...and everything should now succeed. */
+
+	ASSERT_TRUE(try_read_write_buf(ptr1));
+	ASSERT_TRUE(try_read_write_buf(&ptr1[9 * page_size]));
+
+	ASSERT_TRUE(try_read_write_buf(ptr2));
+	ASSERT_TRUE(try_read_write_buf(&ptr2[4 * page_size]));
+
+	ASSERT_TRUE(try_read_write_buf(ptr3));
+	ASSERT_TRUE(try_read_write_buf(&ptr3[19 * page_size]));
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr1, 10 * page_size), 0);
+	ASSERT_EQ(munmap(ptr2, 5 * page_size), 0);
+	ASSERT_EQ(munmap(ptr3, 20 * page_size), 0);
+	close(pidfd);
+}
+
+/* Assert that unmapping ranges does not leave guard markers behind. */
+TEST_F(guard_pages, munmap)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr, *ptr_new1, *ptr_new2;
+
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Guard first and last pages. */
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+	ASSERT_EQ(madvise(&ptr[9 * page_size], page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Assert that they are guarded. */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+	ASSERT_FALSE(try_read_write_buf(&ptr[9 * page_size]));
+
+	/* Unmap them. */
+	ASSERT_EQ(munmap(ptr, page_size), 0);
+	ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0);
+
+	/* Map over them.*/
+	ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE,
+			MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr_new1, MAP_FAILED);
+	ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE,
+			MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr_new2, MAP_FAILED);
+
+	/* Assert that they are now not guarded. */
+	ASSERT_TRUE(try_read_write_buf(ptr_new1));
+	ASSERT_TRUE(try_read_write_buf(ptr_new2));
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/* Assert that mprotect() operations have no bearing on guard markers. */
+TEST_F(guard_pages, mprotect)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	int i;
+
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Guard the middle of the range. */
+	ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
+			  MADV_GUARD_INSTALL), 0);
+
+	/* Assert that it is indeed guarded. */
+	ASSERT_FALSE(try_read_write_buf(&ptr[5 * page_size]));
+	ASSERT_FALSE(try_read_write_buf(&ptr[6 * page_size]));
+
+	/* Now make these pages read-only. */
+	ASSERT_EQ(mprotect(&ptr[5 * page_size], 2 * page_size, PROT_READ), 0);
+
+	/* Make sure the range is still guarded. */
+	ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
+	ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
+
+	/* Make sure we can guard again without issue.*/
+	ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
+			  MADV_GUARD_INSTALL), 0);
+
+	/* Make sure the range is, yet again, still guarded. */
+	ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
+	ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
+
+	/* Now unguard the whole range. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+	/* Make sure the whole range is readable. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_buf(curr));
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/* Split and merge VMAs and make sure guard pages still behave. */
+TEST_F(guard_pages, split_merge)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr, *ptr_new;
+	int i;
+
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Guard the whole range. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Make sure the whole range is guarded. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Now unmap some pages in the range so we split. */
+	ASSERT_EQ(munmap(&ptr[2 * page_size], page_size), 0);
+	ASSERT_EQ(munmap(&ptr[5 * page_size], page_size), 0);
+	ASSERT_EQ(munmap(&ptr[8 * page_size], page_size), 0);
+
+	/* Make sure the remaining ranges are guarded post-split. */
+	for (i = 0; i < 2; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+	for (i = 2; i < 5; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+	for (i = 6; i < 8; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+	for (i = 9; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Now map them again - the unmap will have cleared the guards. */
+	ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE,
+		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr_new, MAP_FAILED);
+	ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
+		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr_new, MAP_FAILED);
+	ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE,
+		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr_new, MAP_FAILED);
+
+	/* Now make sure guard pages are established. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+		bool result = try_read_write_buf(curr);
+		bool expect_true = i == 2 || i == 5 || i == 8;
+
+		ASSERT_TRUE(expect_true ? result : !result);
+	}
+
+	/* Now guard everything again. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Make sure the whole range is guarded. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Now split the range into three. */
+	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
+	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
+
+	/* Make sure the whole range is guarded for read. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_buf(curr));
+	}
+
+	/* Now reset protection bits so we merge the whole thing. */
+	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
+	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
+			   PROT_READ | PROT_WRITE), 0);
+
+	/* Make sure the whole range is still guarded. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Split range into 3 again... */
+	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
+	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
+
+	/* ...and unguard the whole range. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+	/* Make sure the whole range is remedied for read. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_buf(curr));
+	}
+
+	/* Merge them again. */
+	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
+	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
+			   PROT_READ | PROT_WRITE), 0);
+
+	/* Now ensure the merged range is remedied for read/write. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/* Assert that MADV_DONTNEED does not remove guard markers. */
+TEST_F(guard_pages, dontneed)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	int i;
+
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Back the whole range. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		*curr = 'y';
+	}
+
+	/* Guard every other page. */
+	for (i = 0; i < 10; i += 2) {
+		char *curr = &ptr[i * page_size];
+		int res = madvise(curr, page_size, MADV_GUARD_INSTALL);
+
+		ASSERT_EQ(res, 0);
+	}
+
+	/* Indicate that we don't need any of the range. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_DONTNEED), 0);
+
+	/* Check to ensure guard markers are still in place. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+		bool result = try_read_buf(curr);
+
+		if (i % 2 == 0) {
+			ASSERT_FALSE(result);
+		} else {
+			ASSERT_TRUE(result);
+			/* Make sure we really did get reset to zero page. */
+			ASSERT_EQ(*curr, '\0');
+		}
+
+		/* Now write... */
+		result = try_write_buf(&ptr[i * page_size]);
+
+		/* ...and make sure same result. */
+		ASSERT_TRUE(i % 2 != 0 ? result : !result);
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/* Assert that mlock()'ed pages work correctly with guard markers. */
+TEST_F(guard_pages, mlock)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	int i;
+
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Populate. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		*curr = 'y';
+	}
+
+	/* Lock. */
+	ASSERT_EQ(mlock(ptr, 10 * page_size), 0);
+
+	/* Now try to guard, should fail with EINVAL. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), -1);
+	ASSERT_EQ(errno, EINVAL);
+
+	/* OK unlock. */
+	ASSERT_EQ(munlock(ptr, 10 * page_size), 0);
+
+	/* Guard first half of range, should now succeed. */
+	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Make sure guard works. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+		bool result = try_read_write_buf(curr);
+
+		if (i < 5) {
+			ASSERT_FALSE(result);
+		} else {
+			ASSERT_TRUE(result);
+			ASSERT_EQ(*curr, 'x');
+		}
+	}
+
+	/*
+	 * Now lock the latter part of the range. We can't lock the guard pages,
+	 * as this would result in the pages being populated and the guarding
+	 * would cause this to error out.
+	 */
+	ASSERT_EQ(mlock(&ptr[5 * page_size], 5 * page_size), 0);
+
+	/*
+	 * Now remove guard pages, we permit mlock()'d ranges to have guard
+	 * pages removed as it is a non-destructive operation.
+	 */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+	/* Now check that no guard pages remain. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/*
+ * Assert that moving, extending and shrinking memory via mremap() retains
+ * guard markers where possible.
+ *
+ * - Moving a mapping alone should retain markers as they are.
+ */
+TEST_F(guard_pages, mremap_move)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr, *ptr_new;
+
+	/* Map 5 pages. */
+	ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Place guard markers at both ends of the 5 page span. */
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Make sure the guard pages are in effect. */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
+
+	/* Map a new region we will move this range into. Doing this ensures
+	 * that we have reserved a range to map into.
+	 */
+	ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE,
+		       -1, 0);
+	ASSERT_NE(ptr_new, MAP_FAILED);
+
+	ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size,
+			 MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new), ptr_new);
+
+	/* Make sure the guard markers are retained. */
+	ASSERT_FALSE(try_read_write_buf(ptr_new));
+	ASSERT_FALSE(try_read_write_buf(&ptr_new[4 * page_size]));
+
+	/*
+	 * Clean up - we only need reference the new pointer as we overwrote the
+	 * PROT_NONE range and moved the existing one.
+	 */
+	munmap(ptr_new, 5 * page_size);
+}
+
+/*
+ * Assert that moving, extending and shrinking memory via mremap() retains
+ * guard markers where possible.
+ *
+ * Expanding should retain guard pages, only now in different position. The user
+ * will have to remove guard pages manually to fix up (they'd have to do the
+ * same if it were a PROT_NONE mapping).
+ */
+TEST_F(guard_pages, mremap_expand)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr, *ptr_new;
+
+	/* Map 10 pages... */
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+	/* ...But unmap the last 5 so we can ensure we can expand into them. */
+	ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
+
+	/* Place guard markers at both ends of the 5 page span. */
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Make sure the guarding is in effect. */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
+
+	/* Now expand to 10 pages. */
+	ptr = mremap(ptr, 5 * page_size, 10 * page_size, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/*
+	 * Make sure the guard markers are retained in their original positions.
+	 */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
+
+	/* Reserve a region which we can move to and expand into. */
+	ptr_new = mmap(NULL, 20 * page_size, PROT_NONE,
+		       MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr_new, MAP_FAILED);
+
+	/* Now move and expand into it. */
+	ptr = mremap(ptr, 10 * page_size, 20 * page_size,
+		     MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new);
+	ASSERT_EQ(ptr, ptr_new);
+
+	/*
+	 * Again, make sure the guard markers are retained in their original positions.
+	 */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
+
+	/*
+	 * A real user would have to remove guard markers, but would reasonably
+	 * expect all characteristics of the mapping to be retained, including
+	 * guard markers.
+	 */
+
+	/* Cleanup. */
+	munmap(ptr, 20 * page_size);
+}
+/*
+ * Assert that moving, extending and shrinking memory via mremap() retains
+ * guard markers where possible.
+ *
+ * Shrinking will result in markers that are shrunk over being removed. Again,
+ * if the user were using a PROT_NONE mapping they'd have to manually fix this
+ * up also so this is OK.
+ */
+TEST_F(guard_pages, mremap_shrink)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	int i;
+
+	/* Map 5 pages. */
+	ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Place guard markers at both ends of the 5 page span. */
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Make sure the guarding is in effect. */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
+
+	/* Now shrink to 3 pages. */
+	ptr = mremap(ptr, 5 * page_size, 3 * page_size, MREMAP_MAYMOVE);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* We expect the guard marker at the start to be retained... */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+
+	/* ...But remaining pages will not have guard markers. */
+	for (i = 1; i < 3; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/*
+	 * As with expansion, a real user would have to remove guard pages and
+	 * fixup. But you'd have to do similar manual things with PROT_NONE
+	 * mappings too.
+	 */
+
+	/*
+	 * If we expand back to the original size, the end marker will, of
+	 * course, no longer be present.
+	 */
+	ptr = mremap(ptr, 3 * page_size, 5 * page_size, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Again, we expect the guard marker at the start to be retained... */
+	ASSERT_FALSE(try_read_write_buf(ptr));
+
+	/* ...But remaining pages will not have guard markers. */
+	for (i = 1; i < 5; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_TRUE(try_read_write_buf(curr));
+	}
+
+	/* Cleanup. */
+	munmap(ptr, 5 * page_size);
+}
+
+/*
+ * Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
+ * retain guard pages.
+ */
+TEST_F(guard_pages, fork)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	pid_t pid;
+	int i;
+
+	/* Map 10 pages. */
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Establish guard apges in the first 5 pages. */
+	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_INSTALL), 0);
+
+	pid = fork();
+	ASSERT_NE(pid, -1);
+	if (!pid) {
+		/* This is the child process now. */
+
+		/* Assert that the guarding is in effect. */
+		for (i = 0; i < 10; i++) {
+			char *curr = &ptr[i * page_size];
+			bool result = try_read_write_buf(curr);
+
+			ASSERT_TRUE(i >= 5 ? result : !result);
+		}
+
+		/* Now unguard the range.*/
+		ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+		exit(0);
+	}
+
+	/* Parent process. */
+
+	/* Parent simply waits on child. */
+	waitpid(pid, NULL, 0);
+
+	/* Child unguard does not impact parent page table state. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+		bool result = try_read_write_buf(curr);
+
+		ASSERT_TRUE(i >= 5 ? result : !result);
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/*
+ * Assert that forking a process with VMAs that do have VM_WIPEONFORK set
+ * behave as expected.
+ */
+TEST_F(guard_pages, fork_wipeonfork)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	pid_t pid;
+	int i;
+
+	/* Map 10 pages. */
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Mark wipe on fork. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_WIPEONFORK), 0);
+
+	/* Guard the first 5 pages. */
+	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_INSTALL), 0);
+
+	pid = fork();
+	ASSERT_NE(pid, -1);
+	if (!pid) {
+		/* This is the child process now. */
+
+		/* Guard will have been wiped. */
+		for (i = 0; i < 10; i++) {
+			char *curr = &ptr[i * page_size];
+
+			ASSERT_TRUE(try_read_write_buf(curr));
+		}
+
+		exit(0);
+	}
+
+	/* Parent process. */
+
+	waitpid(pid, NULL, 0);
+
+	/* Guard markers should be in effect.*/
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+		bool result = try_read_write_buf(curr);
+
+		ASSERT_TRUE(i >= 5 ? result : !result);
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/* Ensure that MADV_FREE retains guard entries as expected. */
+TEST_F(guard_pages, lazyfree)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	int i;
+
+	/* Map 10 pages. */
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Guard range. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Ensure guarded. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Lazyfree range. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0);
+
+	/* This should leave the guard markers in place. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */
+TEST_F(guard_pages, populate)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+
+	/* Map 10 pages. */
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Guard range. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Populate read should error out... */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_READ), -1);
+	ASSERT_EQ(errno, EFAULT);
+
+	/* ...as should populate write. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_WRITE), -1);
+	ASSERT_EQ(errno, EFAULT);
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove guard markers. */
+TEST_F(guard_pages, cold_pageout)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr;
+	int i;
+
+	/* Map 10 pages. */
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Guard range. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Ensured guarded. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Now mark cold. This should have no impact on guard markers. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_COLD), 0);
+
+	/* Should remain guarded. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* OK, now page out. This should equally, have no effect on markers. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0);
+
+	/* Should remain guarded. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+/* Ensure that guard pages do not break userfaultd. */
+TEST_F(guard_pages, uffd)
+{
+	const unsigned long page_size = self->page_size;
+	int uffd;
+	char *ptr;
+	int i;
+	struct uffdio_api api = {
+		.api = UFFD_API,
+		.features = 0,
+	};
+	struct uffdio_register reg;
+	struct uffdio_range range;
+
+	/* Set up uffd. */
+	uffd = userfaultfd(0);
+	if (uffd == -1 && errno == EPERM)
+		ksft_exit_skip("No userfaultfd permissions, try running as root.\n");
+	ASSERT_NE(uffd, -1);
+
+	ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
+
+	/* Map 10 pages. */
+	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+		   MAP_ANON | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Register the range with uffd. */
+	range.start = (unsigned long)ptr;
+	range.len = 10 * page_size;
+	reg.range = range;
+	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
+	ASSERT_EQ(ioctl(uffd, UFFDIO_REGISTER, &reg), 0);
+
+	/* Guard the range. This should not trigger the uffd. */
+	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
+
+	/* The guarding should behave as usual with no uffd intervention. */
+	for (i = 0; i < 10; i++) {
+		char *curr = &ptr[i * page_size];
+
+		ASSERT_FALSE(try_read_write_buf(curr));
+	}
+
+	/* Cleanup. */
+	ASSERT_EQ(ioctl(uffd, UFFDIO_UNREGISTER, &range), 0);
+	close(uffd);
+	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
+TEST_HARNESS_MAIN
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-23 16:24 ` [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
@ 2024-10-23 23:04   ` Andrew Morton
  2024-10-24  7:34     ` Lorenzo Stoakes
  2024-10-25 18:13   ` Vlastimil Babka
  2024-10-28 20:29   ` Jarkko Sakkinen
  2 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2024-10-23 23:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> 
> ...
>
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel,

How complicated is it to migrate those to use this?

> ...
>
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
>

Is this decision compatible with the above migrations?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-23 16:24 ` [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
@ 2024-10-23 23:12   ` Andrew Morton
  2024-10-24  7:25     ` Lorenzo Stoakes
  2024-10-25 17:12   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2024-10-23 23:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed, 23 Oct 2024 17:24:40 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> We permit this operation on anonymous memory only

What is the reason for this restriction?

(generally, it would be nice to include the proposed manpage update at
this time, so people can review it while the code change is fresh in
their minds)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-23 23:12   ` Andrew Morton
@ 2024-10-24  7:25     ` Lorenzo Stoakes
  2024-10-26  0:11       ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-24  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed, Oct 23, 2024 at 04:12:05PM -0700, Andrew Morton wrote:
> On Wed, 23 Oct 2024 17:24:40 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > We permit this operation on anonymous memory only
>
> What is the reason for this restriction?

At this stage it's a simplification as there are numerous issues that would
need to be overcome to support file-backed mappings.

I actually do plan to extend this work to support shmem and file-backed
mappings in the future as a revision to this work.

>
> (generally, it would be nice to include the proposed manpage update at
> this time, so people can review it while the code change is fresh in
> their minds)

It'd be nice to have the man pages live somewhere within the kernel so we
can do this as part of the patch change as things evolve during review, but
obviously moving things about like that is out of scope for this discussion
:)

I do explicitly intend to send a manpage update once this series lands
however.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-23 23:04   ` Andrew Morton
@ 2024-10-24  7:34     ` Lorenzo Stoakes
  2024-10-24  7:45       ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-24  7:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote:
> On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> >
> > ...
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel,
>
> How complicated is it to migrate those to use this?

I would say probably somewhat difficult as very often people are doing quite
custom things, but I will take a look at seeing if we can't make things a little
more generic.

I am also mildly motivated to look at trying to find a generic way to do
replaces...

Both on the TODO!

>
> > ...
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
>
> Is this decision compatible with the above migrations?

Yes, nobody but mm should be doing this. You have to be enormously careful about
locks, caches, accounting and the various other machinery around actually
setting a PTE, and this allows you to establish entirely new mappings, so could
very much be abused.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-24  7:34     ` Lorenzo Stoakes
@ 2024-10-24  7:45       ` David Hildenbrand
  2024-10-24  8:07         ` Lorenzo Stoakes
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2024-10-24  7:45 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, linux-mm,
	linux-kernel, Muchun Song, Richard Henderson, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On 24.10.24 09:34, Lorenzo Stoakes wrote:
> On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote:
>> On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>>
>>>
>>> ...
>>>
>>> Existing mechanism for performing a walk which also installs page table
>>> entries if necessary are heavily duplicated throughout the kernel,
>>
>> How complicated is it to migrate those to use this?
> 
> I would say probably somewhat difficult as very often people are doing quite
> custom things, but I will take a look at seeing if we can't make things a little
> more generic.
> 
> I am also mildly motivated to look at trying to find a generic way to do
> replaces...
> 
> Both on the TODO!

I'm not super happy about extending the rusty old pagewalk API, because 
it's inefficient (indirect calls) and not future proof (batching, large 
folios).

But I see how we ended up with this patch, and it will be easy to 
convert to something better once we have it.

We already discussed in the past that we need a better and more 
efficient way to walk page tables. I have part of that on my TODO list, 
but I'm getting distracted.

*Inserting* (not walking/modifying existing things as most users to) as 
done in this patch is slightly different though, likely "on thing that 
fits all" will not apply to all page table walker user cases.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-24  7:45       ` David Hildenbrand
@ 2024-10-24  8:07         ` Lorenzo Stoakes
  2024-10-25 19:08           ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-24  8:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Thu, Oct 24, 2024 at 09:45:37AM +0200, David Hildenbrand wrote:
> On 24.10.24 09:34, Lorenzo Stoakes wrote:
> > On Wed, Oct 23, 2024 at 04:04:05PM -0700, Andrew Morton wrote:
> > > On Wed, 23 Oct 2024 17:24:38 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > >
> > > > ...
> > > >
> > > > Existing mechanism for performing a walk which also installs page table
> > > > entries if necessary are heavily duplicated throughout the kernel,
> > >
> > > How complicated is it to migrate those to use this?
> >
> > I would say probably somewhat difficult as very often people are doing quite
> > custom things, but I will take a look at seeing if we can't make things a little
> > more generic.
> >
> > I am also mildly motivated to look at trying to find a generic way to do
> > replaces...
> >
> > Both on the TODO!
>
> I'm not super happy about extending the rusty old pagewalk API, because it's
> inefficient (indirect calls) and not future proof (batching, large folios).

Yeah it could be improved, but I think the ideal way would be to genericise as
much as we can and 'upgrade' this logic.

>
> But I see how we ended up with this patch, and it will be easy to convert to
> something better once we have it.

Yes, I was quite happy with what an ultimately small delta this all ended up
being vs. the various other alternatives (change zap logic, introduce custom
page fault mechanism, duplicating page walk code _yet again_, porting uffd
logic, etc. etc.)

But in an ideal world we'd have _one_ place that does this.

>
> We already discussed in the past that we need a better and more efficient
> way to walk page tables. I have part of that on my TODO list, but I'm
> getting distracted.

Yes I remember an LSF session on this, it's a really obvious area of improvement
that stands out at the moment for sure.

Having worked several 12+ hour days in a row though recently I can relate to
workload making this difficult though :)

>
> *Inserting* (not walking/modifying existing things as most users to) as done
> in this patch is slightly different though, likely "on thing that fits all"
> will not apply to all page table walker user cases.

Yeah, there's also replace scenarios which then have to do egregious amounts of
work to make sure we do everything right, in fact there's duplicates of this in
mm/madvise.c *grumble grumble*.

>
> --
> Cheers,
>
> David / dhildenb
>

OK so I guess I'll hold off my TODOs on this as you are looking in this area and
I trust you :)

Cheers!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-23 16:24 ` [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
  2024-10-23 23:12   ` Andrew Morton
@ 2024-10-25 17:12   ` Lorenzo Stoakes
  2024-10-25 21:56     ` Vlastimil Babka
  2024-10-25 21:44   ` Vlastimil Babka
  2024-10-28 20:45   ` Jarkko Sakkinen
  3 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-25 17:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed, Oct 23, 2024 at 05:24:40PM +0100, Lorenzo Stoakes wrote:
> Implement a new lightweight guard page feature, that is regions of userland
> virtual memory that, when accessed, cause a fatal signal to arise.

<snip>

Hi Andrew - Could you apply the below fix-patch? I realise we must handle
fatal signals and conditional rescheduling in the vector_madvise() special
case.

Thanks!

----8<----
From 546d7e1831c71599fc733d589e0d75f52e84826d Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Fri, 25 Oct 2024 18:05:48 +0100
Subject: [PATCH] mm: yield on fatal signal/cond_sched() in vector_madvise()

While we have to treat -ERESTARTNOINTR specially here as we are looping
through a vector of operations and can't simply restart the entire
operation, we mustn't hold up fatal signals or RT kernels.
---
 mm/madvise.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 48eba25e25fe..127aa5d86656 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1713,8 +1713,14 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		 * we have already rescinded locks, it should be no problem to
 		 * simply try again.
 		 */
-		if (ret == -ERESTARTNOINTR)
+		if (ret == -ERESTARTNOINTR) {
+			if (fatal_signal_pending(current)) {
+				ret = -EINTR;
+				break;
+			}
+			cond_resched();
 			continue;
+		}
 		if (ret < 0)
 			break;
 		iov_iter_advance(iter, iter_iov_len(iter));
--
2.47.0

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-23 16:24 ` [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
  2024-10-23 23:04   ` Andrew Morton
@ 2024-10-25 18:13   ` Vlastimil Babka
  2024-10-25 21:58     ` Lorenzo Stoakes
  2024-10-28 20:29   ` Jarkko Sakkinen
  2 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2024-10-25 18:13 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Paul E . McKenney, Jann Horn, David Hildenbrand, linux-mm,
	linux-kernel, Muchun Song, Richard Henderson, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On 10/23/24 18:24, Lorenzo Stoakes wrote:
> The existing generic pagewalk logic permits the walking of page tables,
> invoking callbacks at individual page table levels via user-provided
> mm_walk_ops callbacks.
> 
> This is useful for traversing existing page table entries, but precludes
> the ability to establish new ones.
> 
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel, each
> with semantic differences from one another and largely unavailable for use
> elsewhere.
> 
> Rather than add yet another implementation, we extend the generic pagewalk
> logic to enable the installation of page table entries by adding a new
> install_pte() callback in mm_walk_ops. If this is specified, then upon
> encountering a missing page table entry, we allocate and install a new one
> and continue the traversal.
> 
> If a THP huge page is encountered at either the PMD or PUD level we split
> it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> split.
> 
> We do not support hugetlb at this stage.
> 
> If this function returns an error, or an allocation fails during the
> operation, we abort the operation altogether. It is up to the caller to
> deal appropriately with partially populated page table ranges.
> 
> If install_pte() is defined, the semantics of pte_entry() change - this
> callback is then only invoked if the entry already exists. This is a useful
> property, as it allows a caller to handle existing PTEs while installing
> new ones where necessary in the specified range.
> 
> If install_pte() is not defined, then there is no functional difference to
> this patch, so all existing logic will work precisely as it did before.
> 
> As we only permit the installation of PTEs where a mapping does not already
> exist there is no need for TLB management, however we do invoke
> update_mmu_cache() for architectures which require manual maintenance of
> mappings for other CPUs.
> 
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
> 
> Reviewed-by: Jann Horn <jannh@google.com>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Just a small subjective suggestion in case you agree and there's a respin or
followups:

> @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		/*
> -		 * Check this here so we only break down trans_huge
> -		 * pages when we _need_ to
> -		 */
> -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry)
> +			continue; /* Nothing to do. */
> +		if (!ops->pte_entry && ops->install_pte &&
> +		    pmd_present(*pmd) &&
> +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +			continue; /* Avoid unnecessary split. */

Much better now, thanks, but maybe the last 2 parts could be:

if (!ops->pte_entry) {
	if (!ops->install_pte)
		continue; /* Nothing to do. */
	else if (pmd_present(*pmd)
		 && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
		continue; /* Avoid unnecessary split. */
}

Or at least put !ops->pte_entry first in both conditions?

>  		if (walk->vma)
>  			split_huge_pmd(walk->vma, pmd, addr);
> +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> +			continue; /* Nothing to do. */
>  
>  		err = walk_pte_range(pmd, addr, next, walk);
>  		if (err)
> @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>   again:
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(*pud)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pmd_alloc(walk->mm, pud, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  
>  		walk->action = ACTION_SUBTREE;
> @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pmd_entry || ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> +			continue;  /* Nothing to do. */
> +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> +		    pud_present(*pud) &&
> +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> +			continue; /* Avoid unnecessary split. */

Ditto.

Thanks!


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-24  8:07         ` Lorenzo Stoakes
@ 2024-10-25 19:08           ` David Hildenbrand
  2024-10-26  7:42             ` Lorenzo Stoakes
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2024-10-25 19:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

>>
>> We already discussed in the past that we need a better and more efficient
>> way to walk page tables. I have part of that on my TODO list, but I'm
>> getting distracted.
> 
> Yes I remember an LSF session on this, it's a really obvious area of improvement
> that stands out at the moment for sure.
> 
> Having worked several 12+ hour days in a row though recently I can relate to
> workload making this difficult though :)

Yes :)

> 
>>
>> *Inserting* (not walking/modifying existing things as most users to) as done
>> in this patch is slightly different though, likely "on thing that fits all"
>> will not apply to all page table walker user cases.
> 
> Yeah, there's also replace scenarios which then have to do egregious amounts of
> work to make sure we do everything right, in fact there's duplicates of this in
> mm/madvise.c *grumble grumble*.
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> OK so I guess I'll hold off my TODOs on this as you are looking in this area and
> I trust you :)

It will probably take me a while until I get to it, though. I'd focus on 
walking (and batching what we can) first, then on top modifying existing 
entries.

The "install something where there is nothing yet" (incl. populating 
fresh page tables etc.) case probably deserves a separate "walker".

If you end up having spare cycles and want to sync on a possible design 
for some part of that bigger task -- removing the old pagewalk logic -- 
please do reach out! :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-23 16:24 ` [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
  2024-10-23 23:12   ` Andrew Morton
  2024-10-25 17:12   ` Lorenzo Stoakes
@ 2024-10-25 21:44   ` Vlastimil Babka
  2024-10-25 21:49     ` Lorenzo Stoakes
  2024-10-28 20:45   ` Jarkko Sakkinen
  3 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2024-10-25 21:44 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Paul E . McKenney, Jann Horn, David Hildenbrand, linux-mm,
	linux-kernel, Muchun Song, Richard Henderson, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On 10/23/24 18:24, Lorenzo Stoakes wrote:
> Implement a new lightweight guard page feature, that is regions of userland
> virtual memory that, when accessed, cause a fatal signal to arise.
> 
> Currently users must establish PROT_NONE ranges to achieve this.
> 
> However this is very costly memory-wise - we need a VMA for each and every
> one of these regions AND they become unmergeable with surrounding VMAs.
> 
> In addition repeated mmap() calls require repeated kernel context switches
> and contention of the mmap lock to install these ranges, potentially also
> having to unmap memory if installed over existing ranges.
> 
> The lightweight guard approach eliminates the VMA cost altogether - rather
> than establishing a PROT_NONE VMA, it operates at the level of page table
> entries - establishing PTE markers such that accesses to them cause a fault
> followed by a SIGSGEV signal being raised.
> 
> This is achieved through the PTE marker mechanism, which we have already
> extended to provide PTE_MARKER_GUARD, which we installed via the generic
> page walking logic which we have extended for this purpose.
> 
> These guard ranges are established with MADV_GUARD_INSTALL. If the range in
> which they are installed contain any existing mappings, they will be
> zapped, i.e. free the range and unmap memory (thus mimicking the behaviour
> of MADV_DONTNEED in this respect).
> 
> Any existing guard entries will be left untouched. There is therefore no
> nesting of guarded pages.
> 
> Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both
> instances the memory range may be reused at which point a user would expect
> guards to still be in place), but they are cleared via MADV_GUARD_REMOVE,
> process teardown or unmapping of memory ranges.
> 
> The guard property can be removed from ranges via MADV_GUARD_REMOVE. The
> ranges over which this is applied, should they contain non-guard entries,
> will be untouched, with only guard entries being cleared.
> 
> We permit this operation on anonymous memory only, and only VMAs which are
> non-special, non-huge and not mlock()'d (if we permitted this we'd have to
> drop locked pages which would be rather counterintuitive).
> 
> Racing page faults can cause repeated attempts to install guard pages that
> are interrupted, result in a zap, and this process can end up being
> repeated. If this happens more than would be expected in normal operation,
> we rescind locks and retry the whole thing, which avoids lock contention in
> this scenario.
> 
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Suggested-by: Jann Horn <jannh@google.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -423,6 +423,12 @@ extern unsigned long highest_memmap_pfn;
>   */
>  #define MAX_RECLAIM_RETRIES 16
>  
> +/*
> + * Maximum number of attempts we make to install guard pages before we give up
> + * and return -ERESTARTNOINTR to have userspace try again.
> + */
> +#define MAX_MADVISE_GUARD_RETRIES 3

Can't we simply put this in mm/madvise.c ? Didn't find usage elsewhere.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-25 21:44   ` Vlastimil Babka
@ 2024-10-25 21:49     ` Lorenzo Stoakes
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-25 21:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Fri, Oct 25, 2024 at 11:44:56PM +0200, Vlastimil Babka wrote:
> On 10/23/24 18:24, Lorenzo Stoakes wrote:
> > Implement a new lightweight guard page feature, that is regions of userland
> > virtual memory that, when accessed, cause a fatal signal to arise.
> >
> > Currently users must establish PROT_NONE ranges to achieve this.
> >
> > However this is very costly memory-wise - we need a VMA for each and every
> > one of these regions AND they become unmergeable with surrounding VMAs.
> >
> > In addition repeated mmap() calls require repeated kernel context switches
> > and contention of the mmap lock to install these ranges, potentially also
> > having to unmap memory if installed over existing ranges.
> >
> > The lightweight guard approach eliminates the VMA cost altogether - rather
> > than establishing a PROT_NONE VMA, it operates at the level of page table
> > entries - establishing PTE markers such that accesses to them cause a fault
> > followed by a SIGSGEV signal being raised.
> >
> > This is achieved through the PTE marker mechanism, which we have already
> > extended to provide PTE_MARKER_GUARD, which we installed via the generic
> > page walking logic which we have extended for this purpose.
> >
> > These guard ranges are established with MADV_GUARD_INSTALL. If the range in
> > which they are installed contain any existing mappings, they will be
> > zapped, i.e. free the range and unmap memory (thus mimicking the behaviour
> > of MADV_DONTNEED in this respect).
> >
> > Any existing guard entries will be left untouched. There is therefore no
> > nesting of guarded pages.
> >
> > Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both
> > instances the memory range may be reused at which point a user would expect
> > guards to still be in place), but they are cleared via MADV_GUARD_REMOVE,
> > process teardown or unmapping of memory ranges.
> >
> > The guard property can be removed from ranges via MADV_GUARD_REMOVE. The
> > ranges over which this is applied, should they contain non-guard entries,
> > will be untouched, with only guard entries being cleared.
> >
> > We permit this operation on anonymous memory only, and only VMAs which are
> > non-special, non-huge and not mlock()'d (if we permitted this we'd have to
> > drop locked pages which would be rather counterintuitive).
> >
> > Racing page faults can cause repeated attempts to install guard pages that
> > are interrupted, result in a zap, and this process can end up being
> > repeated. If this happens more than would be expected in normal operation,
> > we rescind locks and retry the whole thing, which avoids lock contention in
> > this scenario.
> >
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Suggested-by: Jann Horn <jannh@google.com>
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>

Thanks!

> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -423,6 +423,12 @@ extern unsigned long highest_memmap_pfn;
> >   */
> >  #define MAX_RECLAIM_RETRIES 16
> >
> > +/*
> > + * Maximum number of attempts we make to install guard pages before we give up
> > + * and return -ERESTARTNOINTR to have userspace try again.
> > + */
> > +#define MAX_MADVISE_GUARD_RETRIES 3
>
> Can't we simply put this in mm/madvise.c ? Didn't find usage elsewhere.
>
>

Sure, will move if respin/can send a quick fixpatch next week if otherwise
settled. Just felt vaguely 'neater' here for... spurious subjective squishy
brained reasons :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-25 17:12   ` Lorenzo Stoakes
@ 2024-10-25 21:56     ` Vlastimil Babka
  2024-10-25 22:35       ` Lorenzo Stoakes
  0 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2024-10-25 21:56 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Paul E . McKenney, Jann Horn, David Hildenbrand, linux-mm,
	linux-kernel, Muchun Song, Richard Henderson, Matt Turner,
	Thomas Bogendoerfer, James E . J . Bottomley, Helge Deller,
	Chris Zankel, Max Filippov, Arnd Bergmann, linux-alpha,
	linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On 10/25/24 19:12, Lorenzo Stoakes wrote:
> On Wed, Oct 23, 2024 at 05:24:40PM +0100, Lorenzo Stoakes wrote:
>> Implement a new lightweight guard page feature, that is regions of userland
>> virtual memory that, when accessed, cause a fatal signal to arise.
> 
> <snip>
> 
> Hi Andrew - Could you apply the below fix-patch? I realise we must handle
> fatal signals and conditional rescheduling in the vector_madvise() special
> case.
> 
> Thanks!
> 
> ----8<----
> From 546d7e1831c71599fc733d589e0d75f52e84826d Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Fri, 25 Oct 2024 18:05:48 +0100
> Subject: [PATCH] mm: yield on fatal signal/cond_sched() in vector_madvise()
> 
> While we have to treat -ERESTARTNOINTR specially here as we are looping
> through a vector of operations and can't simply restart the entire
> operation, we mustn't hold up fatal signals or RT kernels.

For plain madvise() syscall returning -ERESTARTNOINTR does the right thing
and checks fatal_signal_pending() before returning, right?

Uh actually can we be just returning -ERESTARTNOINTR or do we need to use
restart_syscall()?

> ---
>  mm/madvise.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 48eba25e25fe..127aa5d86656 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1713,8 +1713,14 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  		 * we have already rescinded locks, it should be no problem to
>  		 * simply try again.
>  		 */
> -		if (ret == -ERESTARTNOINTR)
> +		if (ret == -ERESTARTNOINTR) {
> +			if (fatal_signal_pending(current)) {
> +				ret = -EINTR;
> +				break;
> +			}
> +			cond_resched();

Should be unnecessary as we're calling an operation that takes a rwsem so
there are reschedule points already. And with lazy preempt hopefully
cond_resched()s will become history, so let's not add more only to delete later.

>  			continue;
> +		}
>  		if (ret < 0)
>  			break;
>  		iov_iter_advance(iter, iter_iov_len(iter));
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-25 18:13   ` Vlastimil Babka
@ 2024-10-25 21:58     ` Lorenzo Stoakes
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-25 21:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Fri, Oct 25, 2024 at 08:13:26PM +0200, Vlastimil Babka wrote:
> On 10/23/24 18:24, Lorenzo Stoakes wrote:
> > The existing generic pagewalk logic permits the walking of page tables,
> > invoking callbacks at individual page table levels via user-provided
> > mm_walk_ops callbacks.
> >
> > This is useful for traversing existing page table entries, but precludes
> > the ability to establish new ones.
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel, each
> > with semantic differences from one another and largely unavailable for use
> > elsewhere.
> >
> > Rather than add yet another implementation, we extend the generic pagewalk
> > logic to enable the installation of page table entries by adding a new
> > install_pte() callback in mm_walk_ops. If this is specified, then upon
> > encountering a missing page table entry, we allocate and install a new one
> > and continue the traversal.
> >
> > If a THP huge page is encountered at either the PMD or PUD level we split
> > it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> > otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> > split.
> >
> > We do not support hugetlb at this stage.
> >
> > If this function returns an error, or an allocation fails during the
> > operation, we abort the operation altogether. It is up to the caller to
> > deal appropriately with partially populated page table ranges.
> >
> > If install_pte() is defined, the semantics of pte_entry() change - this
> > callback is then only invoked if the entry already exists. This is a useful
> > property, as it allows a caller to handle existing PTEs while installing
> > new ones where necessary in the specified range.
> >
> > If install_pte() is not defined, then there is no functional difference to
> > this patch, so all existing logic will work precisely as it did before.
> >
> > As we only permit the installation of PTEs where a mapping does not already
> > exist there is no need for TLB management, however we do invoke
> > update_mmu_cache() for architectures which require manual maintenance of
> > mappings for other CPUs.
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
> > Reviewed-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

>
> Just a small subjective suggestion in case you agree and there's a respin or
> followups:
>
> > @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		/*
> > -		 * Check this here so we only break down trans_huge
> > -		 * pages when we _need_ to
> > -		 */
> > -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry)
> > +			continue; /* Nothing to do. */
> > +		if (!ops->pte_entry && ops->install_pte &&
> > +		    pmd_present(*pmd) &&
> > +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> > +			continue; /* Avoid unnecessary split. */
>
> Much better now, thanks, but maybe the last 2 parts could be:
>
> if (!ops->pte_entry) {
> 	if (!ops->install_pte)
> 		continue; /* Nothing to do. */
> 	else if (pmd_present(*pmd)
> 		 && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> 		continue; /* Avoid unnecessary split. */
> }

I quite liked separating out the 'nothing to do' vs. the 'unnecessary split'
cases, but I agree it makes it harder to see that the 2nd case is an 'install
pte ONLY' case.

Yeah so I think your version is better, but maybe we can find a way to be more
expressive somehow... if we could declare vars mid-way thhrough it'd be easier
:P

Will improve on respin if it comes up

>
> Or at least put !ops->pte_entry first in both conditions?

Ack yeah that'd be better!

>
> >  		if (walk->vma)
> >  			split_huge_pmd(walk->vma, pmd, addr);
> > +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> > +			continue; /* Nothing to do. */
> >
> >  		err = walk_pte_range(pmd, addr, next, walk);
> >  		if (err)
> > @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >   again:
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none(*pud)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pmd_alloc(walk->mm, pud, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pmd_entry || ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> > +			continue;  /* Nothing to do. */
> > +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> > +		    pud_present(*pud) &&
> > +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> > +			continue; /* Avoid unnecessary split. */
>
> Ditto.

Ack!

>
> Thanks!
>

Cheers!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-25 21:56     ` Vlastimil Babka
@ 2024-10-25 22:35       ` Lorenzo Stoakes
  2024-10-28 12:40         ` Lorenzo Stoakes
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-25 22:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Fri, Oct 25, 2024 at 11:56:52PM +0200, Vlastimil Babka wrote:
> On 10/25/24 19:12, Lorenzo Stoakes wrote:
> > On Wed, Oct 23, 2024 at 05:24:40PM +0100, Lorenzo Stoakes wrote:
> >> Implement a new lightweight guard page feature, that is regions of userland
> >> virtual memory that, when accessed, cause a fatal signal to arise.
> >
> > <snip>
> >
> > Hi Andrew - Could you apply the below fix-patch? I realise we must handle
> > fatal signals and conditional rescheduling in the vector_madvise() special
> > case.
> >
> > Thanks!
> >
> > ----8<----
> > From 546d7e1831c71599fc733d589e0d75f52e84826d Mon Sep 17 00:00:00 2001
> > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Date: Fri, 25 Oct 2024 18:05:48 +0100
> > Subject: [PATCH] mm: yield on fatal signal/cond_sched() in vector_madvise()
> >
> > While we have to treat -ERESTARTNOINTR specially here as we are looping
> > through a vector of operations and can't simply restart the entire
> > operation, we mustn't hold up fatal signals or RT kernels.
>
> For plain madvise() syscall returning -ERESTARTNOINTR does the right thing
> and checks fatal_signal_pending() before returning, right?

I believe so. But now you've caused me some doubt so let me double check
and make absolutely sure :)

>
> Uh actually can we be just returning -ERESTARTNOINTR or do we need to use
> restart_syscall()?

Yeah I was wondering about that, but restart_syscall() seems to set
TIF_SIGPENDING, and I wondered if that was correct... but then I saw other
places that seemed to use it direct so it seemed so.

Let's eliminiate doubt, will check this next week and make sure.

>
> > ---
> >  mm/madvise.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 48eba25e25fe..127aa5d86656 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1713,8 +1713,14 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  		 * we have already rescinded locks, it should be no problem to
> >  		 * simply try again.
> >  		 */
> > -		if (ret == -ERESTARTNOINTR)
> > +		if (ret == -ERESTARTNOINTR) {
> > +			if (fatal_signal_pending(current)) {
> > +				ret = -EINTR;
> > +				break;
> > +			}
> > +			cond_resched();
>
> Should be unnecessary as we're calling an operation that takes a rwsem so
> there are reschedule points already. And with lazy preempt hopefully
> cond_resched()s will become history, so let's not add more only to delete later.

Ack will remove on respin.

>
> >  			continue;
> > +		}
> >  		if (ret < 0)
> >  			break;
> >  		iov_iter_advance(iter, iter_iov_len(iter));
> > --
> > 2.47.0
>

For simplicitly with your other comments too I think I'll respin this next
week.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-24  7:25     ` Lorenzo Stoakes
@ 2024-10-26  0:11       ` Andrew Morton
  2024-10-26  7:40         ` Lorenzo Stoakes
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2024-10-26  0:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Thu, 24 Oct 2024 08:25:46 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> I actually do plan to extend this work to support shmem and file-backed
> mappings in the future as a revision to this work.

Useful, thanks.  I pasted this in.

> >
> > (generally, it would be nice to include the proposed manpage update at
> > this time, so people can review it while the code change is fresh in
> > their minds)
> 
> It'd be nice to have the man pages live somewhere within the kernel so we
> can do this as part of the patch change as things evolve during review, but
> obviously moving things about like that is out of scope for this discussion
> :)

Yes, that would be good.  At present the linkage is so poor that things
could get lost.

I guess one thing we could do is to include the proposed manpage update
within the changelogs.  That way it's stored somewhere and gets reviewed
alongside the patches themselves.

> I do explicitly intend to send a manpage update once this series lands
> however.

That's late, IMO.  Sometimes reviewing manpage updates leads people to
ask "hey.  what about X" or "hey, that's wrong".  Michael Kerrisk was
good at finding such holes, back in the day.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-26  0:11       ` Andrew Morton
@ 2024-10-26  7:40         ` Lorenzo Stoakes
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-26  7:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Fri, Oct 25, 2024 at 05:11:31PM -0700, Andrew Morton wrote:
> On Thu, 24 Oct 2024 08:25:46 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > I actually do plan to extend this work to support shmem and file-backed
> > mappings in the future as a revision to this work.
>
> Useful, thanks.  I pasted this in.
>
> > >
> > > (generally, it would be nice to include the proposed manpage update at
> > > this time, so people can review it while the code change is fresh in
> > > their minds)
> >
> > It'd be nice to have the man pages live somewhere within the kernel so we
> > can do this as part of the patch change as things evolve during review, but
> > obviously moving things about like that is out of scope for this discussion
> > :)
>
> Yes, that would be good.  At present the linkage is so poor that things
> could get lost.

Things _have_ got lost. I don't see MADV_DONTNEED_LOCKED in the madvise()
man page for instance...

(I intend actually to send a patch for that alongside my changes.)

>
> I guess one thing we could do is to include the proposed manpage update
> within the changelogs.  That way it's stored somewhere and gets reviewed
> alongside the patches themselves.

Hm it won't be a diff but I do like the idea... let me come up with
something.

>
> > I do explicitly intend to send a manpage update once this series lands
> > however.
>
> That's late, IMO.  Sometimes reviewing manpage updates leads people to
> ask "hey.  what about X" or "hey, that's wrong".  Michael Kerrisk was
> good at finding such holes, back in the day.
>

Right, this is the problem with having the manpages in a separate repo, it
seems presumptious to _assume_ this will land in 6.13 though I hope it
does, but if I get a patch accepted in the manpages they may ship a version
that has information that is simply invalid...

Really would be nice to integrate it here :)

Michael Kerrisk is a hero, writes with a power of clarity I could only
dream of :) understandable that he may be rather tired of having to
maintain all this however...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-25 19:08           ` David Hildenbrand
@ 2024-10-26  7:42             ` Lorenzo Stoakes
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-26  7:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Fri, Oct 25, 2024 at 09:08:24PM +0200, David Hildenbrand wrote:
> > >
> > > We already discussed in the past that we need a better and more efficient
> > > way to walk page tables. I have part of that on my TODO list, but I'm
> > > getting distracted.
> >
> > Yes I remember an LSF session on this, it's a really obvious area of improvement
> > that stands out at the moment for sure.
> >
> > Having worked several 12+ hour days in a row though recently I can relate to
> > workload making this difficult though :)
>
> Yes :)
>
> >
> > >
> > > *Inserting* (not walking/modifying existing things as most users to) as done
> > > in this patch is slightly different though, likely "on thing that fits all"
> > > will not apply to all page table walker user cases.
> >
> > Yeah, there's also replace scenarios which then have to do egregious amounts of
> > work to make sure we do everything right, in fact there's duplicates of this in
> > mm/madvise.c *grumble grumble*.
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > OK so I guess I'll hold off my TODOs on this as you are looking in this area and
> > I trust you :)
>
> It will probably take me a while until I get to it, though. I'd focus on
> walking (and batching what we can) first, then on top modifying existing
> entries.

Yeah entirely understandable and that's the right order I think, the
modifying path is the trickier one especially with all the special cases
all over the place...

>
> The "install something where there is nothing yet" (incl. populating fresh
> page tables etc.) case probably deserves a separate "walker".

Yes this would avoid all the heavy handling a replace handler needs.

>
> If you end up having spare cycles and want to sync on a possible design for
> some part of that bigger task -- removing the old pagewalk logic -- please
> do reach out! :)

Thanks, I may have a play when I have a brief moment, as I feel quite
strongly we need to attack this (as do you I feel! :) will send some RFC or
some thoughts or whatever should I do so!

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-25 22:35       ` Lorenzo Stoakes
@ 2024-10-28 12:40         ` Lorenzo Stoakes
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-28 12:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Fri, Oct 25, 2024 at 11:35:03PM +0100, Lorenzo Stoakes wrote:
> On Fri, Oct 25, 2024 at 11:56:52PM +0200, Vlastimil Babka wrote:
> > On 10/25/24 19:12, Lorenzo Stoakes wrote:
> > > On Wed, Oct 23, 2024 at 05:24:40PM +0100, Lorenzo Stoakes wrote:
> > >> Implement a new lightweight guard page feature, that is regions of userland
> > >> virtual memory that, when accessed, cause a fatal signal to arise.
> > >
> > > <snip>
> > >
> > > Hi Andrew - Could you apply the below fix-patch? I realise we must handle
> > > fatal signals and conditional rescheduling in the vector_madvise() special
> > > case.
> > >
> > > Thanks!
> > >
> > > ----8<----
> > > From 546d7e1831c71599fc733d589e0d75f52e84826d Mon Sep 17 00:00:00 2001
> > > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Date: Fri, 25 Oct 2024 18:05:48 +0100
> > > Subject: [PATCH] mm: yield on fatal signal/cond_sched() in vector_madvise()
> > >
> > > While we have to treat -ERESTARTNOINTR specially here as we are looping
> > > through a vector of operations and can't simply restart the entire
> > > operation, we mustn't hold up fatal signals or RT kernels.
> >
> > For plain madvise() syscall returning -ERESTARTNOINTR does the right thing
> > and checks fatal_signal_pending() before returning, right?
>
> I believe so. But now you've caused me some doubt so let me double check
> and make absolutely sure :)
>
> >
> > Uh actually can we be just returning -ERESTARTNOINTR or do we need to use
> > restart_syscall()?
>
> Yeah I was wondering about that, but restart_syscall() seems to set
> TIF_SIGPENDING, and I wondered if that was correct... but then I saw other
> places that seemed to use it direct so it seemed so.
>
> Let's eliminiate doubt, will check this next week and make sure.
>

Yeah looks like we do actually, as the function is handled by
arch_do_signal_or_restart():

do_syscall_64()
  -> sycall_exit_to_user_mode_work()
    -> __sycall_exit_to_user_mode_work()
      -> exit_to_user_mode_prepare()
        -> exit_to_user_mode_loop()
	  -> arch_do_signal_or_restart()
	    -> (possibly) get_signal()

And arch_do_signal_or_restart() is only invoked by exit_to_user_mode_loop()
if _TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL is set:

		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
			arch_do_signal_or_restart(regs);

This is set by restart_syscall():

static inline int restart_syscall(void)
{
	set_tsk_thread_flag(current, TIF_SIGPENDING);
	return -ERESTARTNOINTR;
}

It's a nop if no signal is actually pending too, and it handily also deals
with signals...

> >
> > > ---
> > >  mm/madvise.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 48eba25e25fe..127aa5d86656 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1713,8 +1713,14 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> > >  		 * we have already rescinded locks, it should be no problem to
> > >  		 * simply try again.
> > >  		 */
> > > -		if (ret == -ERESTARTNOINTR)
> > > +		if (ret == -ERESTARTNOINTR) {
> > > +			if (fatal_signal_pending(current)) {
> > > +				ret = -EINTR;
> > > +				break;
> > > +			}
> > > +			cond_resched();
> >
> > Should be unnecessary as we're calling an operation that takes a rwsem so
> > there are reschedule points already. And with lazy preempt hopefully
> > cond_resched()s will become history, so let's not add more only to delete later.
>
> Ack will remove on respin.
>
> >
> > >  			continue;
> > > +		}
> > >  		if (ret < 0)
> > >  			break;
> > >  		iov_iter_advance(iter, iter_iov_len(iter));
> > > --
> > > 2.47.0
> >
>
> For simplicitly with your other comments too I think I'll respin this next
> week.

So will respin to use restart_syscall() correctly (+ fix up your other points too).

Have tested and confirmed that failing to use restart_syscall() causes the
-ERESTARTINTR to be returned and no restart, but using it works.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-23 16:24 ` [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
  2024-10-23 23:04   ` Andrew Morton
  2024-10-25 18:13   ` Vlastimil Babka
@ 2024-10-28 20:29   ` Jarkko Sakkinen
  2024-10-28 21:49     ` Lorenzo Stoakes
  2 siblings, 1 reply; 29+ messages in thread
From: Jarkko Sakkinen @ 2024-10-28 20:29 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed Oct 23, 2024 at 7:24 PM EEST, Lorenzo Stoakes wrote:
> The existing generic pagewalk logic permits the walking of page tables,
> invoking callbacks at individual page table levels via user-provided
> mm_walk_ops callbacks.
>
> This is useful for traversing existing page table entries, but precludes
> the ability to establish new ones.
>
> Existing mechanism for performing a walk which also installs page table
> entries if necessary are heavily duplicated throughout the kernel, each
> with semantic differences from one another and largely unavailable for use
> elsewhere.
>
> Rather than add yet another implementation, we extend the generic pagewalk
> logic to enable the installation of page table entries by adding a new
> install_pte() callback in mm_walk_ops. If this is specified, then upon
> encountering a missing page table entry, we allocate and install a new one
> and continue the traversal.
>
> If a THP huge page is encountered at either the PMD or PUD level we split
> it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> split.

Just for interest: does this mean that the operation is always
"destructive" (i.e. modifying state) even when install_pte() does not
do anything, i.e. does the split alway happen despite what the callback
does? Not really expert on this but this paragraph won't leave me
alone :-)

>
> We do not support hugetlb at this stage.
>
> If this function returns an error, or an allocation fails during the
> operation, we abort the operation altogether. It is up to the caller to
> deal appropriately with partially populated page table ranges.
>
> If install_pte() is defined, the semantics of pte_entry() change - this
> callback is then only invoked if the entry already exists. This is a useful
> property, as it allows a caller to handle existing PTEs while installing
> new ones where necessary in the specified range.
>
> If install_pte() is not defined, then there is no functional difference to
> this patch, so all existing logic will work precisely as it did before.
>
> As we only permit the installation of PTEs where a mapping does not already
> exist there is no need for TLB management, however we do invoke
> update_mmu_cache() for architectures which require manual maintenance of
> mappings for other CPUs.
>
> We explicitly do not allow the existing page walk API to expose this
> feature as it is dangerous and intended for internal mm use only. Therefore
> we provide a new walk_page_range_mm() function exposed only to
> mm/internal.h.
>
> Reviewed-by: Jann Horn <jannh@google.com>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/pagewalk.h |  18 +++-
>  mm/internal.h            |   6 ++
>  mm/pagewalk.c            | 227 +++++++++++++++++++++++++++------------
>  3 files changed, 182 insertions(+), 69 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index f5eb5a32aeed..9700a29f8afb 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -25,12 +25,15 @@ enum page_walk_lock {
>   *			this handler is required to be able to handle
>   *			pmd_trans_huge() pmds.  They may simply choose to
>   *			split_huge_page() instead of handling it explicitly.
> - * @pte_entry:		if set, called for each PTE (lowest-level) entry,
> - *			including empty ones
> + * @pte_entry:		if set, called for each PTE (lowest-level) entry
> + *			including empty ones, except if @install_pte is set.
> + *			If @install_pte is set, @pte_entry is called only for
> + *			existing PTEs.
>   * @pte_hole:		if set, called for each hole at all levels,
>   *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
>   *			Any folded depths (where PTRS_PER_P?D is equal to 1)
> - *			are skipped.
> + *			are skipped. If @install_pte is specified, this will
> + *			not trigger for any populated ranges.
>   * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
>   *			function is called with the vma lock held, in order to
>   *			protect against a concurrent freeing of the pte_t* or
> @@ -51,6 +54,13 @@ enum page_walk_lock {
>   * @pre_vma:            if set, called before starting walk on a non-null vma.
>   * @post_vma:           if set, called after a walk on a non-null vma, provided
>   *                      that @pre_vma and the vma walk succeeded.
> + * @install_pte:        if set, missing page table entries are installed and
> + *                      thus all levels are always walked in the specified
> + *                      range. This callback is then invoked at the PTE level
> + *                      (having split any THP pages prior), providing the PTE to
> + *                      install. If allocations fail, the walk is aborted. This
> + *                      operation is only available for userland memory. Not
> + *                      usable for hugetlb ranges.

Given that especially walk_page_range_novma() has bunch of call sites,
it would not hurt to mention here simply that only for mm-internal use
with not much other explanation. 

>   *
>   * p?d_entry callbacks are called even if those levels are folded on a
>   * particular architecture/configuration.
> @@ -76,6 +86,8 @@ struct mm_walk_ops {
>  	int (*pre_vma)(unsigned long start, unsigned long end,
>  		       struct mm_walk *walk);
>  	void (*post_vma)(struct mm_walk *walk);
> +	int (*install_pte)(unsigned long addr, unsigned long next,
> +			   pte_t *ptep, struct mm_walk *walk);
>  	enum page_walk_lock walk_lock;
>  };
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 508f7802dd2b..fb1fb0c984e4 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -12,6 +12,7 @@
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
>  #include <linux/pagemap.h>
> +#include <linux/pagewalk.h>
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> @@ -1451,4 +1452,9 @@ static inline void accept_page(struct page *page)
>  }
>  #endif /* CONFIG_UNACCEPTED_MEMORY */
>  
> +/* pagewalk.c */
> +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> +		unsigned long end, const struct mm_walk_ops *ops,
> +		void *private);
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 5f9f01532e67..f3cbad384344 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -3,9 +3,14 @@
>  #include <linux/highmem.h>
>  #include <linux/sched.h>
>  #include <linux/hugetlb.h>
> +#include <linux/mmu_context.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  
> +#include <asm/tlbflush.h>
> +
> +#include "internal.h"
> +
>  /*
>   * We want to know the real level where a entry is located ignoring any
>   * folding of levels which may be happening. For example if p4d is folded then
> @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
>  	int err = 0;
>  
>  	for (;;) {
> -		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> -		if (err)
> -		       break;
> +		if (ops->install_pte && pte_none(ptep_get(pte))) {
> +			pte_t new_pte;
> +
> +			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
> +					       walk);
> +			if (err)
> +				break;
> +
> +			set_pte_at(walk->mm, addr, pte, new_pte);
> +			/* Non-present before, so for arches that need it. */
> +			if (!WARN_ON_ONCE(walk->no_vma))
> +				update_mmu_cache(walk->vma, addr, pte);
> +		} else {
> +			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> +			if (err)
> +				break;
> +		}
>  		if (addr >= end - PAGE_SIZE)
>  			break;
>  		addr += PAGE_SIZE;
> @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  again:
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_none(*pmd)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pte_alloc(walk->mm, pmd);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  
>  		walk->action = ACTION_SUBTREE;
> @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		/*
> -		 * Check this here so we only break down trans_huge
> -		 * pages when we _need_ to
> -		 */
> -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry)
> +			continue; /* Nothing to do. */
> +		if (!ops->pte_entry && ops->install_pte &&
> +		    pmd_present(*pmd) &&
> +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> +			continue; /* Avoid unnecessary split. */
>  
>  		if (walk->vma)
>  			split_huge_pmd(walk->vma, pmd, addr);
> +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> +			continue; /* Nothing to do. */
>  
>  		err = walk_pte_range(pmd, addr, next, walk);
>  		if (err)
> @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>   again:
>  		next = pud_addr_end(addr, end);
>  		if (pud_none(*pud)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pmd_alloc(walk->mm, pud, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  
>  		walk->action = ACTION_SUBTREE;
> @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>  
>  		if (walk->action == ACTION_AGAIN)
>  			goto again;
> -
> -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> -		    walk->action == ACTION_CONTINUE ||
> -		    !(ops->pmd_entry || ops->pte_entry))
> +		if (walk->action == ACTION_CONTINUE)
>  			continue;
> +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> +			continue;  /* Nothing to do. */
> +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> +		    pud_present(*pud) &&
> +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> +			continue; /* Avoid unnecessary split. */
>  
>  		if (walk->vma)
>  			split_huge_pud(walk->vma, pud, addr);
> +		else if (pud_leaf(*pud) || !pud_present(*pud))
> +			continue; /* Nothing to do. */
> +
>  		if (pud_none(*pud))
>  			goto again;
>  
> @@ -196,18 +228,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	do {
>  		next = p4d_addr_end(addr, end);
>  		if (p4d_none_or_clear_bad(p4d)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __pud_alloc(walk->mm, p4d, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, depth, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  		if (ops->p4d_entry) {
>  			err = ops->p4d_entry(p4d, addr, next, walk);
>  			if (err)
>  				break;
>  		}
> -		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> +		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry ||
> +		    ops->install_pte)
>  			err = walk_pud_range(p4d, addr, next, walk);
>  		if (err)
>  			break;
> @@ -231,18 +267,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none_or_clear_bad(pgd)) {
> -			if (ops->pte_hole)
> +			if (ops->install_pte)
> +				err = __p4d_alloc(walk->mm, pgd, addr);
> +			else if (ops->pte_hole)
>  				err = ops->pte_hole(addr, next, 0, walk);
>  			if (err)
>  				break;
> -			continue;
> +			if (!ops->install_pte)
> +				continue;
>  		}
>  		if (ops->pgd_entry) {
>  			err = ops->pgd_entry(pgd, addr, next, walk);
>  			if (err)
>  				break;
>  		}
> -		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> +		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
> +		    ops->pte_entry || ops->install_pte)
>  			err = walk_p4d_range(pgd, addr, next, walk);
>  		if (err)
>  			break;
> @@ -334,6 +374,11 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>  	int err = 0;
>  	struct vm_area_struct *vma = walk->vma;
>  	const struct mm_walk_ops *ops = walk->ops;
> +	bool is_hugetlb = is_vm_hugetlb_page(vma);
> +
> +	/* We do not support hugetlb PTE installation. */
> +	if (ops->install_pte && is_hugetlb)
> +		return -EINVAL;
>  
>  	if (ops->pre_vma) {
>  		err = ops->pre_vma(start, end, walk);
> @@ -341,7 +386,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>  			return err;
>  	}
>  
> -	if (is_vm_hugetlb_page(vma)) {
> +	if (is_hugetlb) {
>  		if (ops->hugetlb_entry)
>  			err = walk_hugetlb_range(start, end, walk);
>  	} else
> @@ -380,47 +425,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>  #endif
>  }
>  
> -/**
> - * walk_page_range - walk page table with caller specific callbacks
> - * @mm:		mm_struct representing the target process of page table walk
> - * @start:	start address of the virtual address range
> - * @end:	end address of the virtual address range
> - * @ops:	operation to call during the walk
> - * @private:	private data for callbacks' usage
> - *
> - * Recursively walk the page table tree of the process represented by @mm
> - * within the virtual address range [@start, @end). During walking, we can do
> - * some caller-specific works for each entry, by setting up pmd_entry(),
> - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> - * callbacks, the associated entries/pages are just ignored.
> - * The return values of these callbacks are commonly defined like below:
> - *
> - *  - 0  : succeeded to handle the current entry, and if you don't reach the
> - *         end address yet, continue to walk.
> - *  - >0 : succeeded to handle the current entry, and return to the caller
> - *         with caller specific value.
> - *  - <0 : failed to handle the current entry, and return to the caller
> - *         with error code.
> - *
> - * Before starting to walk page table, some callers want to check whether
> - * they really want to walk over the current vma, typically by checking
> - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> - * purpose.
> - *
> - * If operations need to be staged before and committed after a vma is walked,
> - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> - * since it is intended to handle commit-type operations, can't return any
> - * errors.
> - *
> - * struct mm_walk keeps current values of some common data like vma and pmd,
> - * which are useful for the access from callbacks. If you want to pass some
> - * caller-specific data to callbacks, @private should be helpful.
> +/*
> + * See the comment for walk_page_range(), this performs the heavy lifting of the
> + * operation, only sets no restrictions on how the walk proceeds.
>   *
> - * Locking:
> - *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> - *   because these function traverse vma list and/or access to vma's data.
> + * We usually restrict the ability to install PTEs, but this functionality is
> + * available to internal memory management code and provided in mm/internal.h.
>   */
> -int walk_page_range(struct mm_struct *mm, unsigned long start,
> +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>  		unsigned long end, const struct mm_walk_ops *ops,
>  		void *private)
>  {
> @@ -479,6 +491,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>  	return err;
>  }
>  
> +/*
> + * Determine if the walk operations specified are permitted to be used for a
> + * page table walk.
> + *
> + * This check is performed on all functions which are parameterised by walk
> + * operations and exposed in include/linux/pagewalk.h.
> + *
> + * Internal memory management code can use the walk_page_range_mm() function to
> + * be able to use all page walking operations.
> + */
> +static bool check_ops_valid(const struct mm_walk_ops *ops)
> +{
> +	/*
> +	 * The installation of PTEs is solely under the control of memory
> +	 * management logic and subject to many subtle locking, security and
> +	 * cache considerations so we cannot permit other users to do so, and
> +	 * certainly not for exported symbols.
> +	 */
> +	if (ops->install_pte)
> +		return false;
> +
> +	return true;

or "return !!(ops->install_pte);"

> +}

Alternatively one could consider defining "struct mm_walk_internal_ops",
which would be only available in internal.h but I guess there is good
reasons to do it way it is.

> +
> +/**
> + * walk_page_range - walk page table with caller specific callbacks
> + * @mm:		mm_struct representing the target process of page table walk
> + * @start:	start address of the virtual address range
> + * @end:	end address of the virtual address range
> + * @ops:	operation to call during the walk
> + * @private:	private data for callbacks' usage
> + *
> + * Recursively walk the page table tree of the process represented by @mm
> + * within the virtual address range [@start, @end). During walking, we can do
> + * some caller-specific works for each entry, by setting up pmd_entry(),
> + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> + * callbacks, the associated entries/pages are just ignored.
> + * The return values of these callbacks are commonly defined like below:
> + *
> + *  - 0  : succeeded to handle the current entry, and if you don't reach the
> + *         end address yet, continue to walk.
> + *  - >0 : succeeded to handle the current entry, and return to the caller
> + *         with caller specific value.
> + *  - <0 : failed to handle the current entry, and return to the caller
> + *         with error code.
> + *
> + * Before starting to walk page table, some callers want to check whether
> + * they really want to walk over the current vma, typically by checking
> + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> + * purpose.
> + *
> + * If operations need to be staged before and committed after a vma is walked,
> + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> + * since it is intended to handle commit-type operations, can't return any
> + * errors.
> + *
> + * struct mm_walk keeps current values of some common data like vma and pmd,
> + * which are useful for the access from callbacks. If you want to pass some
> + * caller-specific data to callbacks, @private should be helpful.
> + *
> + * Locking:
> + *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> + *   because these function traverse vma list and/or access to vma's data.
> + */
> +int walk_page_range(struct mm_struct *mm, unsigned long start,
> +		unsigned long end, const struct mm_walk_ops *ops,
> +		void *private)
> +{
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
> +
> +	return walk_page_range_mm(mm, start, end, ops, private);
> +}
> +
>  /**
>   * walk_page_range_novma - walk a range of pagetables not backed by a vma
>   * @mm:		mm_struct representing the target process of page table walk
> @@ -494,7 +580,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>   * walking the kernel pages tables or page tables for firmware.
>   *
>   * Note: Be careful to walk the kernel pages tables, the caller may be need to
> - * take other effective approache (mmap lock may be insufficient) to prevent
> + * take other effective approaches (mmap lock may be insufficient) to prevent
>   * the intermediate kernel page tables belonging to the specified address range
>   * from being freed (e.g. memory hot-remove).
>   */
> @@ -513,6 +599,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>  
>  	if (start >= end || !walk.mm)
>  		return -EINVAL;
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
>  
>  	/*
>  	 * 1) For walking the user virtual address space:
> @@ -556,6 +644,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>  		return -EINVAL;
>  	if (start < vma->vm_start || end > vma->vm_end)
>  		return -EINVAL;
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
>  
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
>  	process_vma_walk_lock(vma, ops->walk_lock);
> @@ -574,6 +664,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>  
>  	if (!walk.mm)
>  		return -EINVAL;
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
>  
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
>  	process_vma_walk_lock(vma, ops->walk_lock);
> @@ -623,6 +715,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>  	unsigned long start_addr, end_addr;
>  	int err = 0;
>  
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;
> +
>  	lockdep_assert_held(&mapping->i_mmap_rwsem);
>  	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
>  				  first_index + nr - 1) {

So I took a random patch set in order to learn how to compile Fedora Ark
kernel out of any upstream tree (mm in this case), thus making noise
here.

With this goal, which mainly to be able to do such thing once or twice
per release cycle:

Tested-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 5/5] selftests/mm: add self tests for guard page feature
  2024-10-23 16:24 ` [PATCH v3 5/5] selftests/mm: add self tests for guard page feature Lorenzo Stoakes
@ 2024-10-28 20:32   ` Jarkko Sakkinen
  0 siblings, 0 replies; 29+ messages in thread
From: Jarkko Sakkinen @ 2024-10-28 20:32 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed Oct 23, 2024 at 7:24 PM EEST, Lorenzo Stoakes wrote:
> Utilise the kselftest harmness to implement tests for the guard page
> implementation.
>
> We start by implement basic tests asserting that guard pages can be
> installed, removed and that touching guard pages result in SIGSEGV. We also
> assert that, in removing guard pages from a range, non-guard pages remain
> intact.
>
> We then examine different operations on regions containing guard markers
> behave to ensure correct behaviour:
>
> * Operations over multiple VMAs operate as expected.
> * Invoking MADV_GUARD_INSTALL / MADV_GUARD_REMOVE via process_madvise() in
>   batches works correctly.
> * Ensuring that munmap() correctly tears down guard markers.
> * Using mprotect() to adjust protection bits does not in any way override
>   or cause issues with guard markers.
> * Ensuring that splitting and merging VMAs around guard markers causes no
>   issue - i.e. that a marker which 'belongs' to one VMA can function just
>   as well 'belonging' to another.
> * Ensuring that madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE)
>   do not remove guard markers.
> * Ensuring that mlock()'ing a range containing guard markers does not
>   cause issues.
> * Ensuring that mremap() can move a guard range and retain guard markers.
> * Ensuring that mremap() can expand a guard range and retain guard
>   markers (perhaps moving the range).
> * Ensuring that mremap() can shrink a guard range and retain guard markers.
> * Ensuring that forking a process correctly retains guard markers.
> * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
> * Ensuring that lazyfree simply clears guard markers.
> * Ensuring that userfaultfd can co-exist with guard pages.
> * Ensuring that madvise(..., MADV_POPULATE_READ) and
>   madvise(..., MADV_POPULATE_WRITE) error out when encountering
>   guard markers.
> * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do
>   not remove guard markers.
>
> If any test is unable to be run due to lack of permissions, that test is
> skipped.
>
> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  tools/testing/selftests/mm/.gitignore    |    1 +
>  tools/testing/selftests/mm/Makefile      |    1 +
>  tools/testing/selftests/mm/guard-pages.c | 1239 ++++++++++++++++++++++
>  3 files changed, 1241 insertions(+)
>  create mode 100644 tools/testing/selftests/mm/guard-pages.c
>
> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> index 689bbd520296..8f01f4da1c0d 100644
> --- a/tools/testing/selftests/mm/.gitignore
> +++ b/tools/testing/selftests/mm/.gitignore
> @@ -54,3 +54,4 @@ droppable
>  hugetlb_dio
>  pkey_sighandler_tests_32
>  pkey_sighandler_tests_64
> +guard-pages
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index 02e1204971b0..15c734d6cfec 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
>  TEST_GEN_FILES += hugetlb_madv_vs_map
>  TEST_GEN_FILES += hugetlb_dio
>  TEST_GEN_FILES += droppable
> +TEST_GEN_FILES += guard-pages
>  
>  ifneq ($(ARCH),arm64)
>  TEST_GEN_FILES += soft-dirty
> diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
> new file mode 100644
> index 000000000000..7db9c913e9db
> --- /dev/null
> +++ b/tools/testing/selftests/mm/guard-pages.c
> @@ -0,0 +1,1239 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#define _GNU_SOURCE
> +#include "../kselftest_harness.h"
> +#include <asm-generic/mman.h> /* Force the import of the tools version. */
> +#include <assert.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <linux/userfaultfd.h>
> +#include <setjmp.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/syscall.h>
> +#include <sys/uio.h>
> +#include <unistd.h>
> +
> +/*
> + * Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1:
> + *
> + * "If the signal occurs other than as the result of calling the abort or raise
> + *  function, the behavior is undefined if the signal handler refers to any
> + *  object with static storage duration other than by assigning a value to an
> + *  object declared as volatile sig_atomic_t"
> + */
> +static volatile sig_atomic_t signal_jump_set;
> +static sigjmp_buf signal_jmp_buf;
> +
> +/*
> + * Ignore the checkpatch warning, we must read from x but don't want to do
> + * anything with it in order to trigger a read page fault. We therefore must use
> + * volatile to stop the compiler from optimising this away.
> + */
> +#define FORCE_READ(x) (*(volatile typeof(x) *)x)
> +
> +static int userfaultfd(int flags)
> +{
> +	return syscall(SYS_userfaultfd, flags);
> +}
> +
> +static void handle_fatal(int c)
> +{
> +	if (!signal_jump_set)
> +		return;
> +
> +	siglongjmp(signal_jmp_buf, c);
> +}
> +
> +static int pidfd_open(pid_t pid, unsigned int flags)
> +{
> +	return syscall(SYS_pidfd_open, pid, flags);
> +}
> +
> +/*
> + * Enable our signal catcher and try to read/write the specified buffer. The
> + * return value indicates whether the read/write succeeds without a fatal
> + * signal.
> + */
> +static bool try_access_buf(char *ptr, bool write)
> +{
> +	bool failed;
> +
> +	/* Tell signal handler to jump back here on fatal signal. */
> +	signal_jump_set = true;
> +	/* If a fatal signal arose, we will jump back here and failed is set. */
> +	failed = sigsetjmp(signal_jmp_buf, 0) != 0;
> +
> +	if (!failed) {
> +		if (write)
> +			*ptr = 'x';
> +		else
> +			FORCE_READ(ptr);
> +	}
> +
> +	signal_jump_set = false;
> +	return !failed;
> +}
> +
> +/* Try and read from a buffer, return true if no fatal signal. */
> +static bool try_read_buf(char *ptr)
> +{
> +	return try_access_buf(ptr, false);
> +}
> +
> +/* Try and write to a buffer, return true if no fatal signal. */
> +static bool try_write_buf(char *ptr)
> +{
> +	return try_access_buf(ptr, true);
> +}
> +
> +/*
> + * Try and BOTH read from AND write to a buffer, return true if BOTH operations
> + * succeed.
> + */
> +static bool try_read_write_buf(char *ptr)
> +{
> +	return try_read_buf(ptr) && try_write_buf(ptr);
> +}
> +
> +FIXTURE(guard_pages)
> +{
> +	unsigned long page_size;
> +};
> +
> +FIXTURE_SETUP(guard_pages)
> +{
> +	struct sigaction act = {
> +		.sa_handler = &handle_fatal,
> +		.sa_flags = SA_NODEFER,
> +	};
> +
> +	sigemptyset(&act.sa_mask);
> +	if (sigaction(SIGSEGV, &act, NULL))
> +		ksft_exit_fail_perror("sigaction");
> +
> +	self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
> +};
> +
> +FIXTURE_TEARDOWN(guard_pages)
> +{
> +	struct sigaction act = {
> +		.sa_handler = SIG_DFL,
> +		.sa_flags = SA_NODEFER,
> +	};
> +
> +	sigemptyset(&act.sa_mask);
> +	sigaction(SIGSEGV, &act, NULL);
> +}
> +
> +TEST_F(guard_pages, basic)
> +{
> +	const unsigned long NUM_PAGES = 10;
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_PRIVATE | MAP_ANON, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Trivially assert we can touch the first page. */
> +	ASSERT_TRUE(try_read_write_buf(ptr));
> +
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Establish that 1st page SIGSEGV's. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +
> +	/* Ensure we can touch everything else.*/
> +	for (i = 1; i < NUM_PAGES; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/* Establish a guard page at the end of the mapping. */
> +	ASSERT_EQ(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
> +			  MADV_GUARD_INSTALL), 0);
> +
> +	/* Check that both guard pages result in SIGSEGV. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
> +
> +	/* Remove the first guard page. */
> +	ASSERT_FALSE(madvise(ptr, page_size, MADV_GUARD_REMOVE));
> +
> +	/* Make sure we can touch it. */
> +	ASSERT_TRUE(try_read_write_buf(ptr));
> +
> +	/* Remove the last guard page. */
> +	ASSERT_FALSE(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
> +			     MADV_GUARD_REMOVE));
> +
> +	/* Make sure we can touch it. */
> +	ASSERT_TRUE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
> +
> +	/*
> +	 *  Test setting a _range_ of pages, namely the first 3. The first of
> +	 *  these be faulted in, so this also tests that we can install guard
> +	 *  pages over backed pages.
> +	 */
> +	ASSERT_EQ(madvise(ptr, 3 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Make sure they are all guard pages. */
> +	for (i = 0; i < 3; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Make sure the rest are not. */
> +	for (i = 3; i < NUM_PAGES; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/* Remove guard pages. */
> +	ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_REMOVE), 0);
> +
> +	/* Now make sure we can touch everything. */
> +	for (i = 0; i < NUM_PAGES; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/*
> +	 * Now remove all guard pages, make sure we don't remove existing
> +	 * entries.
> +	 */
> +	ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_REMOVE), 0);
> +
> +	for (i = 0; i < NUM_PAGES * page_size; i += page_size) {
> +		char chr = ptr[i];
> +
> +		ASSERT_EQ(chr, 'x');
> +	}
> +
> +	ASSERT_EQ(munmap(ptr, NUM_PAGES * page_size), 0);
> +}
> +
> +/* Assert that operations applied across multiple VMAs work as expected. */
> +TEST_F(guard_pages, multi_vma)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3;
> +	int i;
> +
> +	/* Reserve a 100 page region over which we can install VMAs. */
> +	ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
> +			  MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_region, MAP_FAILED);
> +
> +	/* Place a VMA of 10 pages size at the start of the region. */
> +	ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr1, MAP_FAILED);
> +
> +	/* Place a VMA of 5 pages size 50 pages into the region. */
> +	ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr2, MAP_FAILED);
> +
> +	/* Place a VMA of 20 pages size at the end of the region. */
> +	ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr3, MAP_FAILED);
> +
> +	/* Unmap gaps. */
> +	ASSERT_EQ(munmap(&ptr_region[10 * page_size], 40 * page_size), 0);
> +	ASSERT_EQ(munmap(&ptr_region[55 * page_size], 25 * page_size), 0);
> +
> +	/*
> +	 * We end up with VMAs like this:
> +	 *
> +	 * 0    10 .. 50   55 .. 80   100
> +	 * [---]      [---]      [---]
> +	 */
> +
> +	/*
> +	 * Now mark the whole range as guard pages and make sure all VMAs are as
> +	 * such.
> +	 */
> +
> +	/*
> +	 * madvise() is certifiable and lets you perform operations over gaps,
> +	 * everything works, but it indicates an error and errno is set to
> +	 * -ENOMEM. Also if anything runs out of memory it is set to
> +	 * -ENOMEM. You are meant to guess which is which.
> +	 */
> +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_INSTALL), -1);
> +	ASSERT_EQ(errno, ENOMEM);
> +
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr1[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	for (i = 0; i < 5; i++) {
> +		char *curr = &ptr2[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	for (i = 0; i < 20; i++) {
> +		char *curr = &ptr3[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Now remove guar pages over range and assert the opposite. */
> +
> +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_REMOVE), -1);
> +	ASSERT_EQ(errno, ENOMEM);
> +
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr1[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	for (i = 0; i < 5; i++) {
> +		char *curr = &ptr2[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	for (i = 0; i < 20; i++) {
> +		char *curr = &ptr3[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/* Now map incompatible VMAs in the gaps. */
> +	ptr = mmap(&ptr_region[10 * page_size], 40 * page_size,
> +		   PROT_READ | PROT_WRITE | PROT_EXEC,
> +		   MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +	ptr = mmap(&ptr_region[55 * page_size], 25 * page_size,
> +		   PROT_READ | PROT_WRITE | PROT_EXEC,
> +		   MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/*
> +	 * We end up with VMAs like this:
> +	 *
> +	 * 0    10 .. 50   55 .. 80   100
> +	 * [---][xxxx][---][xxxx][---]
> +	 *
> +	 * Where 'x' signifies VMAs that cannot be merged with those adjacent to
> +	 * them.
> +	 */
> +
> +	/* Multiple VMAs adjacent to one another should result in no error. */
> +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_INSTALL), 0);
> +	for (i = 0; i < 100; i++) {
> +		char *curr = &ptr_region[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_REMOVE), 0);
> +	for (i = 0; i < 100; i++) {
> +		char *curr = &ptr_region[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr_region, 100 * page_size), 0);
> +}
> +
> +/*
> + * Assert that batched operations performed using process_madvise() work as
> + * expected.
> + */
> +TEST_F(guard_pages, process_madvise)
> +{
> +	const unsigned long page_size = self->page_size;
> +	pid_t pid = getpid();
> +	int pidfd = pidfd_open(pid, 0);
> +	char *ptr_region, *ptr1, *ptr2, *ptr3;
> +	ssize_t count;
> +	struct iovec vec[6];
> +
> +	ASSERT_NE(pidfd, -1);
> +
> +	/* Reserve region to map over. */
> +	ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
> +			  MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_region, MAP_FAILED);
> +
> +	/* 10 pages offset 1 page into reserve region. */
> +	ptr1 = mmap(&ptr_region[page_size], 10 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr1, MAP_FAILED);
> +	/* We want guard markers at start/end of each VMA. */
> +	vec[0].iov_base = ptr1;
> +	vec[0].iov_len = page_size;
> +	vec[1].iov_base = &ptr1[9 * page_size];
> +	vec[1].iov_len = page_size;
> +
> +	/* 5 pages offset 50 pages into reserve region. */
> +	ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr2, MAP_FAILED);
> +	vec[2].iov_base = ptr2;
> +	vec[2].iov_len = page_size;
> +	vec[3].iov_base = &ptr2[4 * page_size];
> +	vec[3].iov_len = page_size;
> +
> +	/* 20 pages offset 79 pages into reserve region. */
> +	ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr3, MAP_FAILED);
> +	vec[4].iov_base = ptr3;
> +	vec[4].iov_len = page_size;
> +	vec[5].iov_base = &ptr3[19 * page_size];
> +	vec[5].iov_len = page_size;
> +
> +	/* Free surrounding VMAs. */
> +	ASSERT_EQ(munmap(ptr_region, page_size), 0);
> +	ASSERT_EQ(munmap(&ptr_region[11 * page_size], 39 * page_size), 0);
> +	ASSERT_EQ(munmap(&ptr_region[55 * page_size], 24 * page_size), 0);
> +	ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0);
> +
> +	/* Now guard in one step. */
> +	count = process_madvise(pidfd, vec, 6, MADV_GUARD_INSTALL, 0);
> +
> +	/* OK we don't have permission to do this, skip. */
> +	if (count == -1 && errno == EPERM)
> +		ksft_exit_skip("No process_madvise() permissions, try running as root.\n");
> +
> +	/* Returns the number of bytes advised. */
> +	ASSERT_EQ(count, 6 * page_size);
> +
> +	/* Now make sure the guarding was applied. */
> +
> +	ASSERT_FALSE(try_read_write_buf(ptr1));
> +	ASSERT_FALSE(try_read_write_buf(&ptr1[9 * page_size]));
> +
> +	ASSERT_FALSE(try_read_write_buf(ptr2));
> +	ASSERT_FALSE(try_read_write_buf(&ptr2[4 * page_size]));
> +
> +	ASSERT_FALSE(try_read_write_buf(ptr3));
> +	ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size]));
> +
> +	/* Now do the same with unguard... */
> +	count = process_madvise(pidfd, vec, 6, MADV_GUARD_REMOVE, 0);
> +
> +	/* ...and everything should now succeed. */
> +
> +	ASSERT_TRUE(try_read_write_buf(ptr1));
> +	ASSERT_TRUE(try_read_write_buf(&ptr1[9 * page_size]));
> +
> +	ASSERT_TRUE(try_read_write_buf(ptr2));
> +	ASSERT_TRUE(try_read_write_buf(&ptr2[4 * page_size]));
> +
> +	ASSERT_TRUE(try_read_write_buf(ptr3));
> +	ASSERT_TRUE(try_read_write_buf(&ptr3[19 * page_size]));
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr1, 10 * page_size), 0);
> +	ASSERT_EQ(munmap(ptr2, 5 * page_size), 0);
> +	ASSERT_EQ(munmap(ptr3, 20 * page_size), 0);
> +	close(pidfd);
> +}
> +
> +/* Assert that unmapping ranges does not leave guard markers behind. */
> +TEST_F(guard_pages, munmap)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr, *ptr_new1, *ptr_new2;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Guard first and last pages. */
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
> +	ASSERT_EQ(madvise(&ptr[9 * page_size], page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Assert that they are guarded. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[9 * page_size]));
> +
> +	/* Unmap them. */
> +	ASSERT_EQ(munmap(ptr, page_size), 0);
> +	ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0);
> +
> +	/* Map over them.*/
> +	ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE,
> +			MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new1, MAP_FAILED);
> +	ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE,
> +			MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new2, MAP_FAILED);
> +
> +	/* Assert that they are now not guarded. */
> +	ASSERT_TRUE(try_read_write_buf(ptr_new1));
> +	ASSERT_TRUE(try_read_write_buf(ptr_new2));
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Assert that mprotect() operations have no bearing on guard markers. */
> +TEST_F(guard_pages, mprotect)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Guard the middle of the range. */
> +	ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
> +			  MADV_GUARD_INSTALL), 0);
> +
> +	/* Assert that it is indeed guarded. */
> +	ASSERT_FALSE(try_read_write_buf(&ptr[5 * page_size]));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[6 * page_size]));
> +
> +	/* Now make these pages read-only. */
> +	ASSERT_EQ(mprotect(&ptr[5 * page_size], 2 * page_size, PROT_READ), 0);
> +
> +	/* Make sure the range is still guarded. */
> +	ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
> +	ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
> +
> +	/* Make sure we can guard again without issue.*/
> +	ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
> +			  MADV_GUARD_INSTALL), 0);
> +
> +	/* Make sure the range is, yet again, still guarded. */
> +	ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
> +	ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
> +
> +	/* Now unguard the whole range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
> +
> +	/* Make sure the whole range is readable. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_buf(curr));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Split and merge VMAs and make sure guard pages still behave. */
> +TEST_F(guard_pages, split_merge)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr, *ptr_new;
> +	int i;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Guard the whole range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Make sure the whole range is guarded. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Now unmap some pages in the range so we split. */
> +	ASSERT_EQ(munmap(&ptr[2 * page_size], page_size), 0);
> +	ASSERT_EQ(munmap(&ptr[5 * page_size], page_size), 0);
> +	ASSERT_EQ(munmap(&ptr[8 * page_size], page_size), 0);
> +
> +	/* Make sure the remaining ranges are guarded post-split. */
> +	for (i = 0; i < 2; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +	for (i = 2; i < 5; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +	for (i = 6; i < 8; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +	for (i = 9; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Now map them again - the unmap will have cleared the guards. */
> +	ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE,
> +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +	ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
> +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +	ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE,
> +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +
> +	/* Now make sure guard pages are established. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +		bool result = try_read_write_buf(curr);
> +		bool expect_true = i == 2 || i == 5 || i == 8;
> +
> +		ASSERT_TRUE(expect_true ? result : !result);
> +	}
> +
> +	/* Now guard everything again. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Make sure the whole range is guarded. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Now split the range into three. */
> +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
> +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
> +
> +	/* Make sure the whole range is guarded for read. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_buf(curr));
> +	}
> +
> +	/* Now reset protection bits so we merge the whole thing. */
> +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
> +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
> +			   PROT_READ | PROT_WRITE), 0);
> +
> +	/* Make sure the whole range is still guarded. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Split range into 3 again... */
> +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
> +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
> +
> +	/* ...and unguard the whole range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
> +
> +	/* Make sure the whole range is remedied for read. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_buf(curr));
> +	}
> +
> +	/* Merge them again. */
> +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
> +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
> +			   PROT_READ | PROT_WRITE), 0);
> +
> +	/* Now ensure the merged range is remedied for read/write. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Assert that MADV_DONTNEED does not remove guard markers. */
> +TEST_F(guard_pages, dontneed)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Back the whole range. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		*curr = 'y';
> +	}
> +
> +	/* Guard every other page. */
> +	for (i = 0; i < 10; i += 2) {
> +		char *curr = &ptr[i * page_size];
> +		int res = madvise(curr, page_size, MADV_GUARD_INSTALL);
> +
> +		ASSERT_EQ(res, 0);
> +	}
> +
> +	/* Indicate that we don't need any of the range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_DONTNEED), 0);
> +
> +	/* Check to ensure guard markers are still in place. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +		bool result = try_read_buf(curr);
> +
> +		if (i % 2 == 0) {
> +			ASSERT_FALSE(result);
> +		} else {
> +			ASSERT_TRUE(result);
> +			/* Make sure we really did get reset to zero page. */
> +			ASSERT_EQ(*curr, '\0');
> +		}
> +
> +		/* Now write... */
> +		result = try_write_buf(&ptr[i * page_size]);
> +
> +		/* ...and make sure same result. */
> +		ASSERT_TRUE(i % 2 != 0 ? result : !result);
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Assert that mlock()'ed pages work correctly with guard markers. */
> +TEST_F(guard_pages, mlock)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Populate. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		*curr = 'y';
> +	}
> +
> +	/* Lock. */
> +	ASSERT_EQ(mlock(ptr, 10 * page_size), 0);
> +
> +	/* Now try to guard, should fail with EINVAL. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), -1);
> +	ASSERT_EQ(errno, EINVAL);
> +
> +	/* OK unlock. */
> +	ASSERT_EQ(munlock(ptr, 10 * page_size), 0);
> +
> +	/* Guard first half of range, should now succeed. */
> +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Make sure guard works. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +		bool result = try_read_write_buf(curr);
> +
> +		if (i < 5) {
> +			ASSERT_FALSE(result);
> +		} else {
> +			ASSERT_TRUE(result);
> +			ASSERT_EQ(*curr, 'x');
> +		}
> +	}
> +
> +	/*
> +	 * Now lock the latter part of the range. We can't lock the guard pages,
> +	 * as this would result in the pages being populated and the guarding
> +	 * would cause this to error out.
> +	 */
> +	ASSERT_EQ(mlock(&ptr[5 * page_size], 5 * page_size), 0);
> +
> +	/*
> +	 * Now remove guard pages, we permit mlock()'d ranges to have guard
> +	 * pages removed as it is a non-destructive operation.
> +	 */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
> +
> +	/* Now check that no guard pages remain. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/*
> + * Assert that moving, extending and shrinking memory via mremap() retains
> + * guard markers where possible.
> + *
> + * - Moving a mapping alone should retain markers as they are.
> + */
> +TEST_F(guard_pages, mremap_move)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr, *ptr_new;
> +
> +	/* Map 5 pages. */
> +	ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Place guard markers at both ends of the 5 page span. */
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
> +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Make sure the guard pages are in effect. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/* Map a new region we will move this range into. Doing this ensures
> +	 * that we have reserved a range to map into.
> +	 */
> +	ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE,
> +		       -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +
> +	ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size,
> +			 MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new), ptr_new);
> +
> +	/* Make sure the guard markers are retained. */
> +	ASSERT_FALSE(try_read_write_buf(ptr_new));
> +	ASSERT_FALSE(try_read_write_buf(&ptr_new[4 * page_size]));
> +
> +	/*
> +	 * Clean up - we only need reference the new pointer as we overwrote the
> +	 * PROT_NONE range and moved the existing one.
> +	 */
> +	munmap(ptr_new, 5 * page_size);
> +}
> +
> +/*
> + * Assert that moving, extending and shrinking memory via mremap() retains
> + * guard markers where possible.
> + *
> + * Expanding should retain guard pages, only now in different position. The user
> + * will have to remove guard pages manually to fix up (they'd have to do the
> + * same if it were a PROT_NONE mapping).
> + */
> +TEST_F(guard_pages, mremap_expand)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr, *ptr_new;
> +
> +	/* Map 10 pages... */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +	/* ...But unmap the last 5 so we can ensure we can expand into them. */
> +	ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
> +
> +	/* Place guard markers at both ends of the 5 page span. */
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
> +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Make sure the guarding is in effect. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/* Now expand to 10 pages. */
> +	ptr = mremap(ptr, 5 * page_size, 10 * page_size, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/*
> +	 * Make sure the guard markers are retained in their original positions.
> +	 */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/* Reserve a region which we can move to and expand into. */
> +	ptr_new = mmap(NULL, 20 * page_size, PROT_NONE,
> +		       MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +
> +	/* Now move and expand into it. */
> +	ptr = mremap(ptr, 10 * page_size, 20 * page_size,
> +		     MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new);
> +	ASSERT_EQ(ptr, ptr_new);
> +
> +	/*
> +	 * Again, make sure the guard markers are retained in their original positions.
> +	 */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/*
> +	 * A real user would have to remove guard markers, but would reasonably
> +	 * expect all characteristics of the mapping to be retained, including
> +	 * guard markers.
> +	 */
> +
> +	/* Cleanup. */
> +	munmap(ptr, 20 * page_size);
> +}
> +/*
> + * Assert that moving, extending and shrinking memory via mremap() retains
> + * guard markers where possible.
> + *
> + * Shrinking will result in markers that are shrunk over being removed. Again,
> + * if the user were using a PROT_NONE mapping they'd have to manually fix this
> + * up also so this is OK.
> + */
> +TEST_F(guard_pages, mremap_shrink)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	/* Map 5 pages. */
> +	ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Place guard markers at both ends of the 5 page span. */
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
> +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Make sure the guarding is in effect. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/* Now shrink to 3 pages. */
> +	ptr = mremap(ptr, 5 * page_size, 3 * page_size, MREMAP_MAYMOVE);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* We expect the guard marker at the start to be retained... */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +
> +	/* ...But remaining pages will not have guard markers. */
> +	for (i = 1; i < 3; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/*
> +	 * As with expansion, a real user would have to remove guard pages and
> +	 * fixup. But you'd have to do similar manual things with PROT_NONE
> +	 * mappings too.
> +	 */
> +
> +	/*
> +	 * If we expand back to the original size, the end marker will, of
> +	 * course, no longer be present.
> +	 */
> +	ptr = mremap(ptr, 3 * page_size, 5 * page_size, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Again, we expect the guard marker at the start to be retained... */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +
> +	/* ...But remaining pages will not have guard markers. */
> +	for (i = 1; i < 5; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_TRUE(try_read_write_buf(curr));
> +	}
> +
> +	/* Cleanup. */
> +	munmap(ptr, 5 * page_size);
> +}
> +
> +/*
> + * Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
> + * retain guard pages.
> + */
> +TEST_F(guard_pages, fork)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	pid_t pid;
> +	int i;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Establish guard apges in the first 5 pages. */
> +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	pid = fork();
> +	ASSERT_NE(pid, -1);
> +	if (!pid) {
> +		/* This is the child process now. */
> +
> +		/* Assert that the guarding is in effect. */
> +		for (i = 0; i < 10; i++) {
> +			char *curr = &ptr[i * page_size];
> +			bool result = try_read_write_buf(curr);
> +
> +			ASSERT_TRUE(i >= 5 ? result : !result);
> +		}
> +
> +		/* Now unguard the range.*/
> +		ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
> +
> +		exit(0);
> +	}
> +
> +	/* Parent process. */
> +
> +	/* Parent simply waits on child. */
> +	waitpid(pid, NULL, 0);
> +
> +	/* Child unguard does not impact parent page table state. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +		bool result = try_read_write_buf(curr);
> +
> +		ASSERT_TRUE(i >= 5 ? result : !result);
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/*
> + * Assert that forking a process with VMAs that do have VM_WIPEONFORK set
> + * behave as expected.
> + */
> +TEST_F(guard_pages, fork_wipeonfork)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	pid_t pid;
> +	int i;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Mark wipe on fork. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_WIPEONFORK), 0);
> +
> +	/* Guard the first 5 pages. */
> +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	pid = fork();
> +	ASSERT_NE(pid, -1);
> +	if (!pid) {
> +		/* This is the child process now. */
> +
> +		/* Guard will have been wiped. */
> +		for (i = 0; i < 10; i++) {
> +			char *curr = &ptr[i * page_size];
> +
> +			ASSERT_TRUE(try_read_write_buf(curr));
> +		}
> +
> +		exit(0);
> +	}
> +
> +	/* Parent process. */
> +
> +	waitpid(pid, NULL, 0);
> +
> +	/* Guard markers should be in effect.*/
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +		bool result = try_read_write_buf(curr);
> +
> +		ASSERT_TRUE(i >= 5 ? result : !result);
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Ensure that MADV_FREE retains guard entries as expected. */
> +TEST_F(guard_pages, lazyfree)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Guard range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Ensure guarded. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Lazyfree range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0);
> +
> +	/* This should leave the guard markers in place. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */
> +TEST_F(guard_pages, populate)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Guard range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Populate read should error out... */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_READ), -1);
> +	ASSERT_EQ(errno, EFAULT);
> +
> +	/* ...as should populate write. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_WRITE), -1);
> +	ASSERT_EQ(errno, EFAULT);
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove guard markers. */
> +TEST_F(guard_pages, cold_pageout)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Guard range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* Ensured guarded. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Now mark cold. This should have no impact on guard markers. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_COLD), 0);
> +
> +	/* Should remain guarded. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* OK, now page out. This should equally, have no effect on markers. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0);
> +
> +	/* Should remain guarded. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Ensure that guard pages do not break userfaultd. */
> +TEST_F(guard_pages, uffd)
> +{
> +	const unsigned long page_size = self->page_size;
> +	int uffd;
> +	char *ptr;
> +	int i;
> +	struct uffdio_api api = {
> +		.api = UFFD_API,
> +		.features = 0,
> +	};
> +	struct uffdio_register reg;
> +	struct uffdio_range range;
> +
> +	/* Set up uffd. */
> +	uffd = userfaultfd(0);
> +	if (uffd == -1 && errno == EPERM)
> +		ksft_exit_skip("No userfaultfd permissions, try running as root.\n");
> +	ASSERT_NE(uffd, -1);
> +
> +	ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Register the range with uffd. */
> +	range.start = (unsigned long)ptr;
> +	range.len = 10 * page_size;
> +	reg.range = range;
> +	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
> +	ASSERT_EQ(ioctl(uffd, UFFDIO_REGISTER, &reg), 0);
> +
> +	/* Guard the range. This should not trigger the uffd. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
> +
> +	/* The guarding should behave as usual with no uffd intervention. */
> +	for (i = 0; i < 10; i++) {
> +		char *curr = &ptr[i * page_size];
> +
> +		ASSERT_FALSE(try_read_write_buf(curr));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(ioctl(uffd, UFFDIO_UNREGISTER, &range), 0);
> +	close(uffd);
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +TEST_HARNESS_MAIN

Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 2/5] mm: add PTE_MARKER_GUARD PTE marker
  2024-10-23 16:24 ` [PATCH v3 2/5] mm: add PTE_MARKER_GUARD PTE marker Lorenzo Stoakes
@ 2024-10-28 20:34   ` Jarkko Sakkinen
  0 siblings, 0 replies; 29+ messages in thread
From: Jarkko Sakkinen @ 2024-10-28 20:34 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed Oct 23, 2024 at 7:24 PM EEST, Lorenzo Stoakes wrote:
> Add a new PTE marker that results in any access causing the accessing
> process to segfault.
>
> This is preferable to PTE_MARKER_POISONED, which results in the same
> handling as hardware poisoned memory, and is thus undesirable for cases
> where we simply wish to 'soft' poison a range.
>
> This is in preparation for implementing the ability to specify guard pages
> at the page table level, i.e. ranges that, when accessed, should cause
> process termination.
>
> Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
> function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
> purpose was simply incorrect.
>
> We then reuse the same logic to determine whether a zap should clear a
> guard entry - this should only be performed on teardown and never on
> MADV_DONTNEED or MADV_FREE.
>
> We additionally add a WARN_ON_ONCE() in hugetlb logic should a guard marker
> be encountered there, as we explicitly do not support this operation and
> this should not occur.
>
> Acked-by: Vlastimil Babka <vbabkba@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/mm_inline.h |  2 +-
>  include/linux/swapops.h   | 24 +++++++++++++++++++++++-
>  mm/hugetlb.c              |  4 ++++
>  mm/memory.c               | 18 +++++++++++++++---
>  mm/mprotect.c             |  6 ++++--
>  5 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 355cf46a01a6..1b6a917fffa4 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -544,7 +544,7 @@ static inline pte_marker copy_pte_marker(
>  {
>  	pte_marker srcm = pte_marker_get(entry);
>  	/* Always copy error entries. */
> -	pte_marker dstm = srcm & PTE_MARKER_POISONED;
> +	pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD);
>  
>  	/* Only copy PTE markers if UFFD register matches. */
>  	if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma))
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index cb468e418ea1..96f26e29fefe 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -426,9 +426,19 @@ typedef unsigned long pte_marker;
>   * "Poisoned" here is meant in the very general sense of "future accesses are
>   * invalid", instead of referring very specifically to hardware memory errors.
>   * This marker is meant to represent any of various different causes of this.
> + *
> + * Note that, when encountered by the faulting logic, PTEs with this marker will
> + * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
> + * logic.
>   */
>  #define  PTE_MARKER_POISONED			BIT(1)
> -#define  PTE_MARKER_MASK			(BIT(2) - 1)
> +/*
> + * Indicates that, on fault, this PTE will case a SIGSEGV signal to be
> + * sent. This means guard markers behave in effect as if the region were mapped
> + * PROT_NONE, rather than if they were a memory hole or equivalent.
> + */
> +#define  PTE_MARKER_GUARD			BIT(2)
> +#define  PTE_MARKER_MASK			(BIT(3) - 1)
>  
>  static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
>  {
> @@ -464,6 +474,18 @@ static inline int is_poisoned_swp_entry(swp_entry_t entry)
>  {
>  	return is_pte_marker_entry(entry) &&
>  	    (pte_marker_get(entry) & PTE_MARKER_POISONED);
> +
> +}
> +
> +static inline swp_entry_t make_guard_swp_entry(void)
> +{
> +	return make_pte_marker_entry(PTE_MARKER_GUARD);
> +}
> +
> +static inline int is_guard_swp_entry(swp_entry_t entry)
> +{
> +	return is_pte_marker_entry(entry) &&
> +		(pte_marker_get(entry) & PTE_MARKER_GUARD);
>  }
>  
>  /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 906294ac85dc..2c8c5da0f5d3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6353,6 +6353,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  				ret = VM_FAULT_HWPOISON_LARGE |
>  				      VM_FAULT_SET_HINDEX(hstate_index(h));
>  				goto out_mutex;
> +			} else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) {
> +				/* This isn't supported in hugetlb. */
> +				ret = VM_FAULT_SIGSEGV;
> +				goto out_mutex;
>  			}
>  		}
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f614523b9f4..551455cd453f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details,
>  	return !folio_test_anon(folio);
>  }
>  
> -static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
> +static inline bool zap_drop_markers(struct zap_details *details)
>  {
>  	if (!details)
>  		return false;
> @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>  	if (vma_is_anonymous(vma))
>  		return;
>  
> -	if (zap_drop_file_uffd_wp(details))
> +	if (zap_drop_markers(details))
>  		return;
>  
>  	for (;;) {
> @@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			 * drop the marker if explicitly requested.
>  			 */
>  			if (!vma_is_anonymous(vma) &&
> -			    !zap_drop_file_uffd_wp(details))
> +			    !zap_drop_markers(details))
> +				continue;
> +		} else if (is_guard_swp_entry(entry)) {
> +			/*
> +			 * Ordinary zapping should not remove guard PTE
> +			 * markers. Only do so if we should remove PTE markers
> +			 * in general.
> +			 */
> +			if (!zap_drop_markers(details))
>  				continue;
>  		} else if (is_hwpoison_entry(entry) ||
>  			   is_poisoned_swp_entry(entry)) {
> @@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>  	if (marker & PTE_MARKER_POISONED)
>  		return VM_FAULT_HWPOISON;
>  
> +	/* Hitting a guard page is always a fatal condition. */
> +	if (marker & PTE_MARKER_GUARD)
> +		return VM_FAULT_SIGSEGV;
> +
>  	if (pte_marker_entry_uffd_wp(entry))
>  		return pte_marker_handle_uffd_wp(vmf);
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 0c5d6d06107d..1f671b0667bd 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -236,9 +236,11 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			} else if (is_pte_marker_entry(entry)) {
>  				/*
>  				 * Ignore error swap entries unconditionally,
> -				 * because any access should sigbus anyway.
> +				 * because any access should sigbus/sigsegv
> +				 * anyway.
>  				 */
> -				if (is_poisoned_swp_entry(entry))
> +				if (is_poisoned_swp_entry(entry) ||
> +				    is_guard_swp_entry(entry))
>  					continue;
>  				/*
>  				 * If this is uffd-wp pte marker and we'd like

Acked-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism
  2024-10-23 16:24 ` [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
                     ` (2 preceding siblings ...)
  2024-10-25 21:44   ` Vlastimil Babka
@ 2024-10-28 20:45   ` Jarkko Sakkinen
  3 siblings, 0 replies; 29+ messages in thread
From: Jarkko Sakkinen @ 2024-10-28 20:45 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Suren Baghdasaryan, Liam R . Howlett, Matthew Wilcox,
	Vlastimil Babka, Paul E . McKenney, Jann Horn, David Hildenbrand,
	linux-mm, linux-kernel, Muchun Song, Richard Henderson,
	Matt Turner, Thomas Bogendoerfer, James E . J . Bottomley,
	Helge Deller, Chris Zankel, Max Filippov, Arnd Bergmann,
	linux-alpha, linux-mips, linux-parisc, linux-arch, Shuah Khan,
	Christian Brauner, linux-kselftest, Sidhartha Kumar, Jeff Xu,
	Christoph Hellwig, linux-api, John Hubbard

On Wed Oct 23, 2024 at 7:24 PM EEST, Lorenzo Stoakes wrote:
> Implement a new lightweight guard page feature, that is regions of userland
> virtual memory that, when accessed, cause a fatal signal to arise.
>
> Currently users must establish PROT_NONE ranges to achieve this.

A bit off-topic but other hack with PROT_NONE is to allocate naturally
aligned ranges:

1. mmap() of 2*N size with PROT_NONE.
2. mmap(MAP_FIXED) of N size from that range.


>
> However this is very costly memory-wise - we need a VMA for each and every
> one of these regions AND they become unmergeable with surrounding VMAs.
>
> In addition repeated mmap() calls require repeated kernel context switches
> and contention of the mmap lock to install these ranges, potentially also
> having to unmap memory if installed over existing ranges.
>
> The lightweight guard approach eliminates the VMA cost altogether - rather
> than establishing a PROT_NONE VMA, it operates at the level of page table
> entries - establishing PTE markers such that accesses to them cause a fault
> followed by a SIGSGEV signal being raised.
>
> This is achieved through the PTE marker mechanism, which we have already
> extended to provide PTE_MARKER_GUARD, which we installed via the generic
> page walking logic which we have extended for this purpose.
>
> These guard ranges are established with MADV_GUARD_INSTALL. If the range in
> which they are installed contain any existing mappings, they will be
> zapped, i.e. free the range and unmap memory (thus mimicking the behaviour
> of MADV_DONTNEED in this respect).
>
> Any existing guard entries will be left untouched. There is therefore no
> nesting of guarded pages.
>
> Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both
> instances the memory range may be reused at which point a user would expect
> guards to still be in place), but they are cleared via MADV_GUARD_REMOVE,
> process teardown or unmapping of memory ranges.
>
> The guard property can be removed from ranges via MADV_GUARD_REMOVE. The
> ranges over which this is applied, should they contain non-guard entries,
> will be untouched, with only guard entries being cleared.
>
> We permit this operation on anonymous memory only, and only VMAs which are
> non-special, non-huge and not mlock()'d (if we permitted this we'd have to
> drop locked pages which would be rather counterintuitive).
>
> Racing page faults can cause repeated attempts to install guard pages that
> are interrupted, result in a zap, and this process can end up being
> repeated. If this happens more than would be expected in normal operation,
> we rescind locks and retry the whole thing, which avoids lock contention in
> this scenario.
>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Suggested-by: Jann Horn <jannh@google.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  mm/internal.h                          |   6 +
>  mm/madvise.c                           | 225 +++++++++++++++++++++++++
>  mm/mseal.c                             |   1 +
>  8 files changed, 247 insertions(+)
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..1e700468a685 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -78,6 +78,9 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
> +#define MADV_GUARD_REMOVE 103		/* unguard range */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 9c48d9a21aa0..b700dae28c48 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -105,6 +105,9 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
> +#define MADV_GUARD_REMOVE 103		/* unguard range */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..b6a709506987 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -75,6 +75,9 @@
>  #define MADV_HWPOISON     100		/* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
>  
> +#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
> +#define MADV_GUARD_REMOVE 103		/* unguard range */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..99d4ccee7f6e 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -113,6 +113,9 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
> +#define MADV_GUARD_REMOVE 103		/* unguard range */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..1ea2c4c33b86 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -79,6 +79,9 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
> +#define MADV_GUARD_REMOVE 103		/* unguard range */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index fb1fb0c984e4..fcf08b5e64dc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -423,6 +423,12 @@ extern unsigned long highest_memmap_pfn;
>   */
>  #define MAX_RECLAIM_RETRIES 16
>  
> +/*
> + * Maximum number of attempts we make to install guard pages before we give up
> + * and return -ERESTARTNOINTR to have userspace try again.
> + */
> +#define MAX_MADVISE_GUARD_RETRIES 3
> +
>  /*
>   * in mm/vmscan.c:
>   */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e871a72a6c32..48eba25e25fe 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -60,6 +60,8 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_POPULATE_READ:
>  	case MADV_POPULATE_WRITE:
>  	case MADV_COLLAPSE:
> +	case MADV_GUARD_INSTALL:
> +	case MADV_GUARD_REMOVE:
>  		return 0;
>  	default:
>  		/* be safe, default to 1. list exceptions explicitly */
> @@ -1017,6 +1019,214 @@ static long madvise_remove(struct vm_area_struct *vma,
>  	return error;
>  }
>  
> +static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
> +{
> +	vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB;
> +
> +	/*
> +	 * A user could lock after setting a guard range but that's fine, as
> +	 * they'd not be able to fault in. The issue arises when we try to zap
> +	 * existing locked VMAs. We don't want to do that.
> +	 */
> +	if (!allow_locked)
> +		disallowed |= VM_LOCKED;
> +
> +	if (!vma_is_anonymous(vma))
> +		return false;
> +
> +	if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE)
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool is_guard_pte_marker(pte_t ptent)
> +{
> +	return is_pte_marker(ptent) &&
> +		is_guard_swp_entry(pte_to_swp_entry(ptent));
> +}
> +
> +static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
> +{
> +	pud_t pudval = pudp_get(pud);
> +
> +	/* If huge return >0 so we abort the operation + zap. */
> +	return pud_trans_huge(pudval) || pud_devmap(pudval);
> +}
> +
> +static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
> +{
> +	pmd_t pmdval = pmdp_get(pmd);
> +
> +	/* If huge return >0 so we abort the operation + zap. */
> +	return pmd_trans_huge(pmdval) || pmd_devmap(pmdval);
> +}
> +
> +static int guard_install_pte_entry(pte_t *pte, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
> +{
> +	pte_t pteval = ptep_get(pte);
> +	unsigned long *nr_pages = (unsigned long *)walk->private;
> +
> +	/* If there is already a guard page marker, we have nothing to do. */
> +	if (is_guard_pte_marker(pteval)) {
> +		(*nr_pages)++;
> +
> +		return 0;
> +	}
> +
> +	/* If populated return >0 so we abort the operation + zap. */
> +	return 1;
> +}
> +
> +static int guard_install_set_pte(unsigned long addr, unsigned long next,
> +				 pte_t *ptep, struct mm_walk *walk)
> +{
> +	unsigned long *nr_pages = (unsigned long *)walk->private;
> +
> +	/* Simply install a PTE marker, this causes segfault on access. */
> +	*ptep = make_pte_marker(PTE_MARKER_GUARD);
> +	(*nr_pages)++;
> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops guard_install_walk_ops = {
> +	.pud_entry		= guard_install_pud_entry,
> +	.pmd_entry		= guard_install_pmd_entry,
> +	.pte_entry		= guard_install_pte_entry,
> +	.install_pte		= guard_install_set_pte,
> +	.walk_lock		= PGWALK_RDLOCK,
> +};
> +
> +static long madvise_guard_install(struct vm_area_struct *vma,
> +				 struct vm_area_struct **prev,
> +				 unsigned long start, unsigned long end)
> +{
> +	long err;
> +	int i;
> +
> +	*prev = vma;
> +	if (!is_valid_guard_vma(vma, /* allow_locked = */false))
> +		return -EINVAL;
> +
> +	/*
> +	 * If we install guard markers, then the range is no longer
> +	 * empty from a page table perspective and therefore it's
> +	 * appropriate to have an anon_vma.
> +	 *
> +	 * This ensures that on fork, we copy page tables correctly.
> +	 */
> +	err = anon_vma_prepare(vma);
> +	if (err)
> +		return err;
> +
> +	/*
> +	 * Optimistically try to install the guard marker pages first. If any
> +	 * non-guard pages are encountered, give up and zap the range before
> +	 * trying again.
> +	 *
> +	 * We try a few times before giving up and releasing back to userland to
> +	 * loop around, releasing locks in the process to avoid contention. This
> +	 * would only happen if there was a great many racing page faults.
> +	 *
> +	 * In most cases we should simply install the guard markers immediately
> +	 * with no zap or looping.
> +	 */
> +	for (i = 0; i < MAX_MADVISE_GUARD_RETRIES; i++) {
> +		unsigned long nr_pages = 0;
> +
> +		/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
> +		err = walk_page_range_mm(vma->vm_mm, start, end,
> +					 &guard_install_walk_ops, &nr_pages);
> +		if (err < 0)
> +			return err;
> +
> +		if (err == 0) {
> +			unsigned long nr_expected_pages = PHYS_PFN(end - start);
> +
> +			VM_WARN_ON(nr_pages != nr_expected_pages);
> +			return 0;
> +		}
> +
> +		/*
> +		 * OK some of the range have non-guard pages mapped, zap
> +		 * them. This leaves existing guard pages in place.
> +		 */
> +		zap_page_range_single(vma, start, end - start, NULL);
> +	}
> +
> +	/*
> +	 * We were unable to install the guard pages due to being raced by page
> +	 * faults. This should not happen ordinarily. We return to userspace and
> +	 * immediately retry, relieving lock contention.
> +	 */
> +	return -ERESTARTNOINTR;
> +}
> +
> +static int guard_remove_pud_entry(pud_t *pud, unsigned long addr,
> +				  unsigned long next, struct mm_walk *walk)
> +{
> +	pud_t pudval = pudp_get(pud);
> +
> +	/* If huge, cannot have guard pages present, so no-op - skip. */
> +	if (pud_trans_huge(pudval) || pud_devmap(pudval))
> +		walk->action = ACTION_CONTINUE;
> +
> +	return 0;
> +}
> +
> +static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr,
> +				  unsigned long next, struct mm_walk *walk)
> +{
> +	pmd_t pmdval = pmdp_get(pmd);
> +
> +	/* If huge, cannot have guard pages present, so no-op - skip. */
> +	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
> +		walk->action = ACTION_CONTINUE;
> +
> +	return 0;
> +}
> +
> +static int guard_remove_pte_entry(pte_t *pte, unsigned long addr,
> +				  unsigned long next, struct mm_walk *walk)
> +{
> +	pte_t ptent = ptep_get(pte);
> +
> +	if (is_guard_pte_marker(ptent)) {
> +		/* Simply clear the PTE marker. */
> +		pte_clear_not_present_full(walk->mm, addr, pte, false);
> +		update_mmu_cache(walk->vma, addr, pte);
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops guard_remove_walk_ops = {
> +	.pud_entry		= guard_remove_pud_entry,
> +	.pmd_entry		= guard_remove_pmd_entry,
> +	.pte_entry		= guard_remove_pte_entry,
> +	.walk_lock		= PGWALK_RDLOCK,
> +};
> +
> +static long madvise_guard_remove(struct vm_area_struct *vma,
> +				 struct vm_area_struct **prev,
> +				 unsigned long start, unsigned long end)
> +{
> +	*prev = vma;
> +	/*
> +	 * We're ok with removing guards in mlock()'d ranges, as this is a
> +	 * non-destructive action.
> +	 */
> +	if (!is_valid_guard_vma(vma, /* allow_locked = */true))
> +		return -EINVAL;
> +
> +	return walk_page_range(vma->vm_mm, start, end,
> +			       &guard_remove_walk_ops, NULL);
> +}
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1098,6 +1308,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>  		break;
>  	case MADV_COLLAPSE:
>  		return madvise_collapse(vma, prev, start, end);
> +	case MADV_GUARD_INSTALL:
> +		return madvise_guard_install(vma, prev, start, end);
> +	case MADV_GUARD_REMOVE:
> +		return madvise_guard_remove(vma, prev, start, end);
>  	}
>  
>  	anon_name = anon_vma_name(vma);
> @@ -1197,6 +1411,8 @@ madvise_behavior_valid(int behavior)
>  	case MADV_DODUMP:
>  	case MADV_WIPEONFORK:
>  	case MADV_KEEPONFORK:
> +	case MADV_GUARD_INSTALL:
> +	case MADV_GUARD_REMOVE:
>  #ifdef CONFIG_MEMORY_FAILURE
>  	case MADV_SOFT_OFFLINE:
>  	case MADV_HWPOISON:
> @@ -1490,6 +1706,15 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  	while (iov_iter_count(iter)) {
>  		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
>  				 iter_iov_len(iter), behavior);
> +		/*
> +		 * We cannot return this, as we instead return the number of
> +		 * successful operations. Since all this would achieve in a
> +		 * single madvise() invocation is to re-enter the syscall, and
> +		 * we have already rescinded locks, it should be no problem to
> +		 * simply try again.
> +		 */
> +		if (ret == -ERESTARTNOINTR)
> +			continue;
>  		if (ret < 0)
>  			break;
>  		iov_iter_advance(iter, iter_iov_len(iter));
> diff --git a/mm/mseal.c b/mm/mseal.c
> index ece977bd21e1..81d6e980e8a9 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior)
>  	case MADV_REMOVE:
>  	case MADV_DONTFORK:
>  	case MADV_WIPEONFORK:
> +	case MADV_GUARD_INSTALL:
>  		return true;
>  	}
>  

Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs
  2024-10-28 20:29   ` Jarkko Sakkinen
@ 2024-10-28 21:49     ` Lorenzo Stoakes
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2024-10-28 21:49 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Andrew Morton, Suren Baghdasaryan, Liam R . Howlett,
	Matthew Wilcox, Vlastimil Babka, Paul E . McKenney, Jann Horn,
	David Hildenbrand, linux-mm, linux-kernel, Muchun Song,
	Richard Henderson, Matt Turner, Thomas Bogendoerfer,
	James E . J . Bottomley, Helge Deller, Chris Zankel, Max Filippov,
	Arnd Bergmann, linux-alpha, linux-mips, linux-parisc, linux-arch,
	Shuah Khan, Christian Brauner, linux-kselftest, Sidhartha Kumar,
	Jeff Xu, Christoph Hellwig, linux-api, John Hubbard

Note there's a v4 :)

On Mon, Oct 28, 2024 at 10:29:44PM +0200, Jarkko Sakkinen wrote:
> On Wed Oct 23, 2024 at 7:24 PM EEST, Lorenzo Stoakes wrote:
> > The existing generic pagewalk logic permits the walking of page tables,
> > invoking callbacks at individual page table levels via user-provided
> > mm_walk_ops callbacks.
> >
> > This is useful for traversing existing page table entries, but precludes
> > the ability to establish new ones.
> >
> > Existing mechanism for performing a walk which also installs page table
> > entries if necessary are heavily duplicated throughout the kernel, each
> > with semantic differences from one another and largely unavailable for use
> > elsewhere.
> >
> > Rather than add yet another implementation, we extend the generic pagewalk
> > logic to enable the installation of page table entries by adding a new
> > install_pte() callback in mm_walk_ops. If this is specified, then upon
> > encountering a missing page table entry, we allocate and install a new one
> > and continue the traversal.
> >
> > If a THP huge page is encountered at either the PMD or PUD level we split
> > it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level),
> > otherwise if there is only an ops->install_pte(), we avoid the unnecessary
> > split.
>
> Just for interest: does this mean that the operation is always
> "destructive" (i.e. modifying state) even when install_pte() does not
> do anything, i.e. does the split alway happen despite what the callback
> does? Not really expert on this but this paragraph won't leave me
> alone :-)

Well no as I say (perhaps not clearly) if something already exists we don't
even split huge pages. We only do so if you explicitly ask to examine page
tables levels below those where a huge page may exist.

The guard page code goes to great lengths to avoid this in all cases and
doesn't split at all.

>
> >
> > We do not support hugetlb at this stage.
> >
> > If this function returns an error, or an allocation fails during the
> > operation, we abort the operation altogether. It is up to the caller to
> > deal appropriately with partially populated page table ranges.
> >
> > If install_pte() is defined, the semantics of pte_entry() change - this
> > callback is then only invoked if the entry already exists. This is a useful
> > property, as it allows a caller to handle existing PTEs while installing
> > new ones where necessary in the specified range.
> >
> > If install_pte() is not defined, then there is no functional difference to
> > this patch, so all existing logic will work precisely as it did before.
> >
> > As we only permit the installation of PTEs where a mapping does not already
> > exist there is no need for TLB management, however we do invoke
> > update_mmu_cache() for architectures which require manual maintenance of
> > mappings for other CPUs.
> >
> > We explicitly do not allow the existing page walk API to expose this
> > feature as it is dangerous and intended for internal mm use only. Therefore
> > we provide a new walk_page_range_mm() function exposed only to
> > mm/internal.h.
> >
> > Reviewed-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  include/linux/pagewalk.h |  18 +++-
> >  mm/internal.h            |   6 ++
> >  mm/pagewalk.c            | 227 +++++++++++++++++++++++++++------------
> >  3 files changed, 182 insertions(+), 69 deletions(-)
> >
> > diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> > index f5eb5a32aeed..9700a29f8afb 100644
> > --- a/include/linux/pagewalk.h
> > +++ b/include/linux/pagewalk.h
> > @@ -25,12 +25,15 @@ enum page_walk_lock {
> >   *			this handler is required to be able to handle
> >   *			pmd_trans_huge() pmds.  They may simply choose to
> >   *			split_huge_page() instead of handling it explicitly.
> > - * @pte_entry:		if set, called for each PTE (lowest-level) entry,
> > - *			including empty ones
> > + * @pte_entry:		if set, called for each PTE (lowest-level) entry
> > + *			including empty ones, except if @install_pte is set.
> > + *			If @install_pte is set, @pte_entry is called only for
> > + *			existing PTEs.
> >   * @pte_hole:		if set, called for each hole at all levels,
> >   *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
> >   *			Any folded depths (where PTRS_PER_P?D is equal to 1)
> > - *			are skipped.
> > + *			are skipped. If @install_pte is specified, this will
> > + *			not trigger for any populated ranges.
> >   * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
> >   *			function is called with the vma lock held, in order to
> >   *			protect against a concurrent freeing of the pte_t* or
> > @@ -51,6 +54,13 @@ enum page_walk_lock {
> >   * @pre_vma:            if set, called before starting walk on a non-null vma.
> >   * @post_vma:           if set, called after a walk on a non-null vma, provided
> >   *                      that @pre_vma and the vma walk succeeded.
> > + * @install_pte:        if set, missing page table entries are installed and
> > + *                      thus all levels are always walked in the specified
> > + *                      range. This callback is then invoked at the PTE level
> > + *                      (having split any THP pages prior), providing the PTE to
> > + *                      install. If allocations fail, the walk is aborted. This
> > + *                      operation is only available for userland memory. Not
> > + *                      usable for hugetlb ranges.
>
> Given that especially walk_page_range_novma() has bunch of call sites,
> it would not hurt to mention here simply that only for mm-internal use
> with not much other explanation.

We explicitly document this in multiple places. A user will very quickly
discover this is not available.

I will adjust this blurb if I need to do a respin.

>
> >   *
> >   * p?d_entry callbacks are called even if those levels are folded on a
> >   * particular architecture/configuration.
> > @@ -76,6 +86,8 @@ struct mm_walk_ops {
> >  	int (*pre_vma)(unsigned long start, unsigned long end,
> >  		       struct mm_walk *walk);
> >  	void (*post_vma)(struct mm_walk *walk);
> > +	int (*install_pte)(unsigned long addr, unsigned long next,
> > +			   pte_t *ptep, struct mm_walk *walk);
> >  	enum page_walk_lock walk_lock;
> >  };
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 508f7802dd2b..fb1fb0c984e4 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/mm_inline.h>
> >  #include <linux/pagemap.h>
> > +#include <linux/pagewalk.h>
> >  #include <linux/rmap.h>
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> > @@ -1451,4 +1452,9 @@ static inline void accept_page(struct page *page)
> >  }
> >  #endif /* CONFIG_UNACCEPTED_MEMORY */
> >
> > +/* pagewalk.c */
> > +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> > +		unsigned long end, const struct mm_walk_ops *ops,
> > +		void *private);
> > +
> >  #endif	/* __MM_INTERNAL_H */
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 5f9f01532e67..f3cbad384344 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -3,9 +3,14 @@
> >  #include <linux/highmem.h>
> >  #include <linux/sched.h>
> >  #include <linux/hugetlb.h>
> > +#include <linux/mmu_context.h>
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> >
> > +#include <asm/tlbflush.h>
> > +
> > +#include "internal.h"
> > +
> >  /*
> >   * We want to know the real level where a entry is located ignoring any
> >   * folding of levels which may be happening. For example if p4d is folded then
> > @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
> >  	int err = 0;
> >
> >  	for (;;) {
> > -		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> > -		if (err)
> > -		       break;
> > +		if (ops->install_pte && pte_none(ptep_get(pte))) {
> > +			pte_t new_pte;
> > +
> > +			err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte,
> > +					       walk);
> > +			if (err)
> > +				break;
> > +
> > +			set_pte_at(walk->mm, addr, pte, new_pte);
> > +			/* Non-present before, so for arches that need it. */
> > +			if (!WARN_ON_ONCE(walk->no_vma))
> > +				update_mmu_cache(walk->vma, addr, pte);
> > +		} else {
> > +			err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> > +			if (err)
> > +				break;
> > +		}
> >  		if (addr >= end - PAGE_SIZE)
> >  			break;
> >  		addr += PAGE_SIZE;
> > @@ -89,11 +108,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >  again:
> >  		next = pmd_addr_end(addr, end);
> >  		if (pmd_none(*pmd)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pte_alloc(walk->mm, pmd);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		/*
> > -		 * Check this here so we only break down trans_huge
> > -		 * pages when we _need_ to
> > -		 */
> > -		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry)
> > +			continue; /* Nothing to do. */
> > +		if (!ops->pte_entry && ops->install_pte &&
> > +		    pmd_present(*pmd) &&
> > +		    (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
> > +			continue; /* Avoid unnecessary split. */
> >
> >  		if (walk->vma)
> >  			split_huge_pmd(walk->vma, pmd, addr);
> > +		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
> > +			continue; /* Nothing to do. */
> >
> >  		err = walk_pte_range(pmd, addr, next, walk);
> >  		if (err)
> > @@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >   again:
> >  		next = pud_addr_end(addr, end);
> >  		if (pud_none(*pud)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pmd_alloc(walk->mm, pud, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >
> >  		walk->action = ACTION_SUBTREE;
> > @@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> >
> >  		if (walk->action == ACTION_AGAIN)
> >  			goto again;
> > -
> > -		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
> > -		    walk->action == ACTION_CONTINUE ||
> > -		    !(ops->pmd_entry || ops->pte_entry))
> > +		if (walk->action == ACTION_CONTINUE)
> >  			continue;
> > +		if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
> > +			continue;  /* Nothing to do. */
> > +		if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
> > +		    pud_present(*pud) &&
> > +		    (pud_trans_huge(*pud) || pud_devmap(*pud)))
> > +			continue; /* Avoid unnecessary split. */
> >
> >  		if (walk->vma)
> >  			split_huge_pud(walk->vma, pud, addr);
> > +		else if (pud_leaf(*pud) || !pud_present(*pud))
> > +			continue; /* Nothing to do. */
> > +
> >  		if (pud_none(*pud))
> >  			goto again;
> >
> > @@ -196,18 +228,22 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> >  	do {
> >  		next = p4d_addr_end(addr, end);
> >  		if (p4d_none_or_clear_bad(p4d)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __pud_alloc(walk->mm, p4d, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, depth, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >  		if (ops->p4d_entry) {
> >  			err = ops->p4d_entry(p4d, addr, next, walk);
> >  			if (err)
> >  				break;
> >  		}
> > -		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> > +		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry ||
> > +		    ops->install_pte)
> >  			err = walk_pud_range(p4d, addr, next, walk);
> >  		if (err)
> >  			break;
> > @@ -231,18 +267,22 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
> >  	do {
> >  		next = pgd_addr_end(addr, end);
> >  		if (pgd_none_or_clear_bad(pgd)) {
> > -			if (ops->pte_hole)
> > +			if (ops->install_pte)
> > +				err = __p4d_alloc(walk->mm, pgd, addr);
> > +			else if (ops->pte_hole)
> >  				err = ops->pte_hole(addr, next, 0, walk);
> >  			if (err)
> >  				break;
> > -			continue;
> > +			if (!ops->install_pte)
> > +				continue;
> >  		}
> >  		if (ops->pgd_entry) {
> >  			err = ops->pgd_entry(pgd, addr, next, walk);
> >  			if (err)
> >  				break;
> >  		}
> > -		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
> > +		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
> > +		    ops->pte_entry || ops->install_pte)
> >  			err = walk_p4d_range(pgd, addr, next, walk);
> >  		if (err)
> >  			break;
> > @@ -334,6 +374,11 @@ static int __walk_page_range(unsigned long start, unsigned long end,
> >  	int err = 0;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	const struct mm_walk_ops *ops = walk->ops;
> > +	bool is_hugetlb = is_vm_hugetlb_page(vma);
> > +
> > +	/* We do not support hugetlb PTE installation. */
> > +	if (ops->install_pte && is_hugetlb)
> > +		return -EINVAL;
> >
> >  	if (ops->pre_vma) {
> >  		err = ops->pre_vma(start, end, walk);
> > @@ -341,7 +386,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
> >  			return err;
> >  	}
> >
> > -	if (is_vm_hugetlb_page(vma)) {
> > +	if (is_hugetlb) {
> >  		if (ops->hugetlb_entry)
> >  			err = walk_hugetlb_range(start, end, walk);
> >  	} else
> > @@ -380,47 +425,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> >  #endif
> >  }
> >
> > -/**
> > - * walk_page_range - walk page table with caller specific callbacks
> > - * @mm:		mm_struct representing the target process of page table walk
> > - * @start:	start address of the virtual address range
> > - * @end:	end address of the virtual address range
> > - * @ops:	operation to call during the walk
> > - * @private:	private data for callbacks' usage
> > - *
> > - * Recursively walk the page table tree of the process represented by @mm
> > - * within the virtual address range [@start, @end). During walking, we can do
> > - * some caller-specific works for each entry, by setting up pmd_entry(),
> > - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> > - * callbacks, the associated entries/pages are just ignored.
> > - * The return values of these callbacks are commonly defined like below:
> > - *
> > - *  - 0  : succeeded to handle the current entry, and if you don't reach the
> > - *         end address yet, continue to walk.
> > - *  - >0 : succeeded to handle the current entry, and return to the caller
> > - *         with caller specific value.
> > - *  - <0 : failed to handle the current entry, and return to the caller
> > - *         with error code.
> > - *
> > - * Before starting to walk page table, some callers want to check whether
> > - * they really want to walk over the current vma, typically by checking
> > - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> > - * purpose.
> > - *
> > - * If operations need to be staged before and committed after a vma is walked,
> > - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> > - * since it is intended to handle commit-type operations, can't return any
> > - * errors.
> > - *
> > - * struct mm_walk keeps current values of some common data like vma and pmd,
> > - * which are useful for the access from callbacks. If you want to pass some
> > - * caller-specific data to callbacks, @private should be helpful.
> > +/*
> > + * See the comment for walk_page_range(), this performs the heavy lifting of the
> > + * operation, only sets no restrictions on how the walk proceeds.
> >   *
> > - * Locking:
> > - *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> > - *   because these function traverse vma list and/or access to vma's data.
> > + * We usually restrict the ability to install PTEs, but this functionality is
> > + * available to internal memory management code and provided in mm/internal.h.
> >   */
> > -int walk_page_range(struct mm_struct *mm, unsigned long start,
> > +int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> >  		unsigned long end, const struct mm_walk_ops *ops,
> >  		void *private)
> >  {
> > @@ -479,6 +491,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
> >  	return err;
> >  }
> >
> > +/*
> > + * Determine if the walk operations specified are permitted to be used for a
> > + * page table walk.
> > + *
> > + * This check is performed on all functions which are parameterised by walk
> > + * operations and exposed in include/linux/pagewalk.h.
> > + *
> > + * Internal memory management code can use the walk_page_range_mm() function to
> > + * be able to use all page walking operations.
> > + */
> > +static bool check_ops_valid(const struct mm_walk_ops *ops)
> > +{
> > +	/*
> > +	 * The installation of PTEs is solely under the control of memory
> > +	 * management logic and subject to many subtle locking, security and
> > +	 * cache considerations so we cannot permit other users to do so, and
> > +	 * certainly not for exported symbols.
> > +	 */
> > +	if (ops->install_pte)
> > +		return false;
> > +
> > +	return true;
>
> or "return !!(ops->install_pte);"
>
> > +}
>
> Alternatively one could consider defining "struct mm_walk_internal_ops",
> which would be only available in internal.h but I guess there is good
> reasons to do it way it is.

Yes.

>
> > +
> > +/**
> > + * walk_page_range - walk page table with caller specific callbacks
> > + * @mm:		mm_struct representing the target process of page table walk
> > + * @start:	start address of the virtual address range
> > + * @end:	end address of the virtual address range
> > + * @ops:	operation to call during the walk
> > + * @private:	private data for callbacks' usage
> > + *
> > + * Recursively walk the page table tree of the process represented by @mm
> > + * within the virtual address range [@start, @end). During walking, we can do
> > + * some caller-specific works for each entry, by setting up pmd_entry(),
> > + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these
> > + * callbacks, the associated entries/pages are just ignored.
> > + * The return values of these callbacks are commonly defined like below:
> > + *
> > + *  - 0  : succeeded to handle the current entry, and if you don't reach the
> > + *         end address yet, continue to walk.
> > + *  - >0 : succeeded to handle the current entry, and return to the caller
> > + *         with caller specific value.
> > + *  - <0 : failed to handle the current entry, and return to the caller
> > + *         with error code.
> > + *
> > + * Before starting to walk page table, some callers want to check whether
> > + * they really want to walk over the current vma, typically by checking
> > + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
> > + * purpose.
> > + *
> > + * If operations need to be staged before and committed after a vma is walked,
> > + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
> > + * since it is intended to handle commit-type operations, can't return any
> > + * errors.
> > + *
> > + * struct mm_walk keeps current values of some common data like vma and pmd,
> > + * which are useful for the access from callbacks. If you want to pass some
> > + * caller-specific data to callbacks, @private should be helpful.
> > + *
> > + * Locking:
> > + *   Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock,
> > + *   because these function traverse vma list and/or access to vma's data.
> > + */
> > +int walk_page_range(struct mm_struct *mm, unsigned long start,
> > +		unsigned long end, const struct mm_walk_ops *ops,
> > +		void *private)
> > +{
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> > +
> > +	return walk_page_range_mm(mm, start, end, ops, private);
> > +}
> > +
> >  /**
> >   * walk_page_range_novma - walk a range of pagetables not backed by a vma
> >   * @mm:		mm_struct representing the target process of page table walk
> > @@ -494,7 +580,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
> >   * walking the kernel pages tables or page tables for firmware.
> >   *
> >   * Note: Be careful to walk the kernel pages tables, the caller may be need to
> > - * take other effective approache (mmap lock may be insufficient) to prevent
> > + * take other effective approaches (mmap lock may be insufficient) to prevent
> >   * the intermediate kernel page tables belonging to the specified address range
> >   * from being freed (e.g. memory hot-remove).
> >   */
> > @@ -513,6 +599,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> >
> >  	if (start >= end || !walk.mm)
> >  		return -EINVAL;
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> >
> >  	/*
> >  	 * 1) For walking the user virtual address space:
> > @@ -556,6 +644,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
> >  		return -EINVAL;
> >  	if (start < vma->vm_start || end > vma->vm_end)
> >  		return -EINVAL;
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> >
> >  	process_mm_walk_lock(walk.mm, ops->walk_lock);
> >  	process_vma_walk_lock(vma, ops->walk_lock);
> > @@ -574,6 +664,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> >
> >  	if (!walk.mm)
> >  		return -EINVAL;
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> >
> >  	process_mm_walk_lock(walk.mm, ops->walk_lock);
> >  	process_vma_walk_lock(vma, ops->walk_lock);
> > @@ -623,6 +715,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
> >  	unsigned long start_addr, end_addr;
> >  	int err = 0;
> >
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
> > +
> >  	lockdep_assert_held(&mapping->i_mmap_rwsem);
> >  	vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index,
> >  				  first_index + nr - 1) {
>
> So I took a random patch set in order to learn how to compile Fedora Ark
> kernel out of any upstream tree (mm in this case), thus making noise
> here.
>
> With this goal, which mainly to be able to do such thing once or twice
> per release cycle:
>
> Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
>
> BR, Jarkko

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-10-28 21:50 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-23 16:24 [PATCH v3 0/5] implement lightweight guard pages Lorenzo Stoakes
2024-10-23 16:24 ` [PATCH v3 1/5] mm: pagewalk: add the ability to install PTEs Lorenzo Stoakes
2024-10-23 23:04   ` Andrew Morton
2024-10-24  7:34     ` Lorenzo Stoakes
2024-10-24  7:45       ` David Hildenbrand
2024-10-24  8:07         ` Lorenzo Stoakes
2024-10-25 19:08           ` David Hildenbrand
2024-10-26  7:42             ` Lorenzo Stoakes
2024-10-25 18:13   ` Vlastimil Babka
2024-10-25 21:58     ` Lorenzo Stoakes
2024-10-28 20:29   ` Jarkko Sakkinen
2024-10-28 21:49     ` Lorenzo Stoakes
2024-10-23 16:24 ` [PATCH v3 2/5] mm: add PTE_MARKER_GUARD PTE marker Lorenzo Stoakes
2024-10-28 20:34   ` Jarkko Sakkinen
2024-10-23 16:24 ` [PATCH v3 3/5] mm: madvise: implement lightweight guard page mechanism Lorenzo Stoakes
2024-10-23 23:12   ` Andrew Morton
2024-10-24  7:25     ` Lorenzo Stoakes
2024-10-26  0:11       ` Andrew Morton
2024-10-26  7:40         ` Lorenzo Stoakes
2024-10-25 17:12   ` Lorenzo Stoakes
2024-10-25 21:56     ` Vlastimil Babka
2024-10-25 22:35       ` Lorenzo Stoakes
2024-10-28 12:40         ` Lorenzo Stoakes
2024-10-25 21:44   ` Vlastimil Babka
2024-10-25 21:49     ` Lorenzo Stoakes
2024-10-28 20:45   ` Jarkko Sakkinen
2024-10-23 16:24 ` [PATCH v3 4/5] tools: testing: update tools UAPI header for mman-common.h Lorenzo Stoakes
2024-10-23 16:24 ` [PATCH v3 5/5] selftests/mm: add self tests for guard page feature Lorenzo Stoakes
2024-10-28 20:32   ` Jarkko Sakkinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).