linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: akpm@linux-foundation.org
Cc: ryan.roberts@arm.com, david@redhat.com, willy@infradead.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	catalin.marinas@arm.com, will@kernel.org,
	Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com,
	vbabka@suse.cz, jannh@google.com, anshuman.khandual@arm.com,
	peterx@redhat.com, joey.gouly@arm.com, ioworker0@gmail.com,
	baohua@kernel.org, kevin.brodsky@arm.com,
	quic_zhenhuah@quicinc.com, christophe.leroy@csgroup.eu,
	yangyicong@hisilicon.com, linux-arm-kernel@lists.infradead.org,
	hughd@google.com, yang@os.amperecomputing.com, ziy@nvidia.com,
	Dev Jain <dev.jain@arm.com>
Subject: [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching
Date: Sat, 28 Jun 2025 17:04:34 +0530	[thread overview]
Message-ID: <20250628113435.46678-4-dev.jain@arm.com> (raw)
In-Reply-To: <20250628113435.46678-1-dev.jain@arm.com>

Use folio_pte_batch to batch process a large folio. Reuse the folio from
prot_numa case if possible.

For all cases other than the PageAnonExclusive case, if the case holds true
for one pte in the batch, one can confirm that that case will hold true for
other ptes in the batch too; for pte_needs_soft_dirty_wp(), we do not pass
FPB_IGNORE_SOFT_DIRTY. modify_prot_start_ptes() collects the dirty
and access bits across the batch, therefore batching across
pte_dirty(): this is correct since the dirty bit on the PTE really is
just an indication that the folio got written to, so even if the PTE is
not actually dirty (but one of the PTEs in the batch is), the wp-fault
optimization can be made.

The crux now is how to batch around the PageAnonExclusive case; we must
check the corresponding condition for every single page. Therefore, from
the large folio batch, we process sub batches of ptes mapping pages with
the same PageAnonExclusive condition, and process that sub batch, then
determine and process the next sub batch, and so on. Note that this does
not cause any extra overhead; if suppose the size of the folio batch
is 512, then the sub batch processing in total will take 512 iterations,
which is the same as what we would have done before.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mprotect.c | 143 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 117 insertions(+), 26 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 627b0d67cc4a..28c7ce7728ff 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -40,35 +40,47 @@
 
 #include "internal.h"
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte)
-{
-	struct page *page;
+enum tristate {
+	TRI_FALSE = 0,
+	TRI_TRUE = 1,
+	TRI_MAYBE = -1,
+};
 
+/*
+ * Returns enum tristate indicating whether the pte can be changed to writable.
+ * If TRI_MAYBE is returned, then the folio is anonymous and the user must
+ * additionally check PageAnonExclusive() for every page in the desired range.
+ */
+static int maybe_change_pte_writable(struct vm_area_struct *vma,
+				     unsigned long addr, pte_t pte,
+				     struct folio *folio)
+{
 	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
-		return false;
+		return TRI_FALSE;
 
 	/* Don't touch entries that are not even readable. */
 	if (pte_protnone(pte))
-		return false;
+		return TRI_FALSE;
 
 	/* Do we need write faults for softdirty tracking? */
 	if (pte_needs_soft_dirty_wp(vma, pte))
-		return false;
+		return TRI_FALSE;
 
 	/* Do we need write faults for uffd-wp tracking? */
 	if (userfaultfd_pte_wp(vma, pte))
-		return false;
+		return TRI_FALSE;
 
 	if (!(vma->vm_flags & VM_SHARED)) {
 		/*
 		 * Writable MAP_PRIVATE mapping: We can only special-case on
 		 * exclusive anonymous pages, because we know that our
 		 * write-fault handler similarly would map them writable without
-		 * any additional checks while holding the PT lock.
+		 * any additional checks while holding the PT lock. So if the
+		 * folio is not anonymous, we know we cannot change pte to
+		 * writable. If it is anonymous then the caller must further
+		 * check that the page is AnonExclusive().
 		 */
-		page = vm_normal_page(vma, addr, pte);
-		return page && PageAnon(page) && PageAnonExclusive(page);
+		return (!folio || folio_test_anon(folio)) ? TRI_MAYBE : TRI_FALSE;
 	}
 
 	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
@@ -80,15 +92,61 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	 * FS was already notified and we can simply mark the PTE writable
 	 * just like the write-fault handler would do.
 	 */
-	return pte_dirty(pte);
+	return pte_dirty(pte) ? TRI_TRUE : TRI_FALSE;
+}
+
+/*
+ * Returns the number of pages within the folio, starting from the page
+ * indicated by pgidx and up to pgidx + max_nr, that have the same value of
+ * PageAnonExclusive(). Must only be called for anonymous folios. Value of
+ * PageAnonExclusive() is returned in *exclusive.
+ */
+static int anon_exclusive_batch(struct folio *folio, int pgidx, int max_nr,
+				bool *exclusive)
+{
+	struct page *page;
+	int nr = 1;
+
+	if (!folio) {
+		*exclusive = false;
+		return nr;
+	}
+
+	page = folio_page(folio, pgidx++);
+	*exclusive = PageAnonExclusive(page);
+	while (nr < max_nr) {
+		page = folio_page(folio, pgidx++);
+		if ((*exclusive) != PageAnonExclusive(page))
+			break;
+		nr++;
+	}
+
+	return nr;
+}
+
+bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte)
+{
+	struct page *page;
+	int ret;
+
+	ret = maybe_change_pte_writable(vma, addr, pte, NULL);
+	if (ret == TRI_MAYBE) {
+		page = vm_normal_page(vma, addr, pte);
+		ret = page && PageAnon(page) && PageAnonExclusive(page);
+	}
+
+	return ret;
 }
 
 static int mprotect_folio_pte_batch(struct folio *folio, unsigned long addr,
-		pte_t *ptep, pte_t pte, int max_nr_ptes)
+		pte_t *ptep, pte_t pte, int max_nr_ptes, fpb_t switch_off_flags)
 {
-	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+
+	flags &= ~switch_off_flags;
 
-	if (!folio || !folio_test_large(folio) || (max_nr_ptes == 1))
+	if (!folio || !folio_test_large(folio))
 		return 1;
 
 	return folio_pte_batch(folio, addr, ptep, pte, max_nr_ptes, flags,
@@ -154,7 +212,8 @@ static int prot_numa_skip_ptes(struct folio **foliop, struct vm_area_struct *vma
 	}
 
 skip_batch:
-	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte, max_nr_ptes);
+	nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
+					   max_nr_ptes, 0);
 out:
 	*foliop = folio;
 	return nr_ptes;
@@ -191,7 +250,10 @@ static long change_pte_range(struct mmu_gather *tlb,
 		if (pte_present(oldpte)) {
 			int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
 			struct folio *folio = NULL;
-			pte_t ptent;
+			int sub_nr_ptes, pgidx = 0;
+			pte_t ptent, newpte;
+			bool sub_set_write;
+			int set_write;
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
@@ -206,6 +268,11 @@ static long change_pte_range(struct mmu_gather *tlb,
 					continue;
 			}
 
+			if (!folio)
+				folio = vm_normal_folio(vma, addr, oldpte);
+
+			nr_ptes = mprotect_folio_pte_batch(folio, addr, pte, oldpte,
+							   max_nr_ptes, FPB_IGNORE_SOFT_DIRTY);
 			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
 
@@ -227,15 +294,39 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 * example, if a PTE is already dirty and no other
 			 * COW or special handling is required.
 			 */
-			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
-				ptent = pte_mkwrite(ptent, vma);
-
-			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
-			if (pte_needs_flush(oldpte, ptent))
-				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
-			pages++;
+			set_write = (cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
+				    !pte_write(ptent);
+			if (set_write)
+				set_write = maybe_change_pte_writable(vma, addr, ptent, folio);
+
+			while (nr_ptes) {
+				if (set_write == TRI_MAYBE) {
+					sub_nr_ptes = anon_exclusive_batch(folio,
+						pgidx, nr_ptes, &sub_set_write);
+				} else {
+					sub_nr_ptes = nr_ptes;
+					sub_set_write = (set_write == TRI_TRUE);
+				}
+
+				if (sub_set_write)
+					newpte = pte_mkwrite(ptent, vma);
+				else
+					newpte = ptent;
+
+				modify_prot_commit_ptes(vma, addr, pte, oldpte,
+							newpte, sub_nr_ptes);
+				if (pte_needs_flush(oldpte, newpte))
+					tlb_flush_pte_range(tlb, addr,
+						sub_nr_ptes * PAGE_SIZE);
+
+				addr += sub_nr_ptes * PAGE_SIZE;
+				pte += sub_nr_ptes;
+				oldpte = pte_advance_pfn(oldpte, sub_nr_ptes);
+				ptent = pte_advance_pfn(ptent, sub_nr_ptes);
+				nr_ptes -= sub_nr_ptes;
+				pages += sub_nr_ptes;
+				pgidx += sub_nr_ptes;
+			}
 		} else if (is_swap_pte(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
-- 
2.30.2



  parent reply	other threads:[~2025-06-28 11:44 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-28 11:34 [PATCH v4 0/4] Optimize mprotect() for large folios Dev Jain
2025-06-28 11:34 ` [PATCH v4 1/4] mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs Dev Jain
2025-06-30  9:42   ` Ryan Roberts
2025-06-30  9:49     ` Dev Jain
2025-06-30  9:55       ` Ryan Roberts
2025-06-30 10:05         ` Dev Jain
2025-06-30 11:25   ` Lorenzo Stoakes
2025-06-30 11:39     ` Ryan Roberts
2025-06-30 11:53       ` Lorenzo Stoakes
2025-06-30 11:40     ` Dev Jain
2025-06-30 11:51       ` Lorenzo Stoakes
2025-06-30 11:56         ` Dev Jain
2025-07-02  9:37   ` Lorenzo Stoakes
2025-07-02 15:01     ` Dev Jain
2025-07-02 15:37       ` Lorenzo Stoakes
2025-06-28 11:34 ` [PATCH v4 2/4] mm: Add batched versions of ptep_modify_prot_start/commit Dev Jain
2025-06-30 10:10   ` Ryan Roberts
2025-06-30 10:17     ` Dev Jain
2025-06-30 10:35       ` Ryan Roberts
2025-06-30 10:42         ` Dev Jain
2025-06-30 12:57   ` Lorenzo Stoakes
2025-07-01  4:44     ` Dev Jain
2025-07-01  7:33       ` Ryan Roberts
2025-07-01  8:06         ` Lorenzo Stoakes
2025-07-01  8:23           ` Ryan Roberts
2025-07-01  8:34             ` Lorenzo Stoakes
2025-06-28 11:34 ` Dev Jain [this message]
2025-06-28 12:39   ` [PATCH v4 3/4] mm: Optimize mprotect() by PTE-batching Dev Jain
2025-06-30 10:31   ` Ryan Roberts
2025-06-30 11:21     ` Dev Jain
2025-06-30 11:47       ` Dev Jain
2025-06-30 11:50       ` Ryan Roberts
2025-06-30 11:53         ` Dev Jain
2025-07-01  5:47     ` Dev Jain
2025-07-01  7:39       ` Ryan Roberts
2025-06-30 12:52   ` Lorenzo Stoakes
2025-07-01  5:30     ` Dev Jain
2025-07-01  8:03     ` Ryan Roberts
2025-07-01  8:06       ` Dev Jain
2025-07-01  8:24         ` Ryan Roberts
2025-07-01  8:15       ` Lorenzo Stoakes
2025-07-01  8:30         ` Ryan Roberts
2025-07-01  8:51           ` Lorenzo Stoakes
2025-07-01  9:53             ` Ryan Roberts
2025-07-01 10:21               ` Lorenzo Stoakes
2025-07-01 11:31                 ` Ryan Roberts
2025-07-01 13:40                   ` Lorenzo Stoakes
2025-07-02 10:32                     ` Lorenzo Stoakes
2025-07-02 15:03                       ` Dev Jain
2025-07-02 15:22                         ` Lorenzo Stoakes
2025-07-03 12:59                           ` David Hildenbrand
2025-06-28 11:34 ` [PATCH v4 4/4] arm64: Add batched versions of ptep_modify_prot_start/commit Dev Jain
2025-06-30 10:43   ` Ryan Roberts
2025-06-29 23:05 ` [PATCH v4 0/4] Optimize mprotect() for large folios Andrew Morton
2025-06-30  3:33   ` Dev Jain
2025-06-30 10:45     ` Ryan Roberts
2025-06-30 11:22       ` Dev Jain
2025-06-30 11:17 ` Lorenzo Stoakes
2025-06-30 11:25   ` Dev Jain
2025-06-30 11:27 ` Lorenzo Stoakes
2025-06-30 11:43   ` Dev Jain
2025-07-01  0:08     ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250628113435.46678-4-dev.jain@arm.com \
    --to=dev.jain@arm.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jannh@google.com \
    --cc=joey.gouly@arm.com \
    --cc=kevin.brodsky@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=peterx@redhat.com \
    --cc=quic_zhenhuah@quicinc.com \
    --cc=ryan.roberts@arm.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=yangyicong@hisilicon.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).