From: David Hildenbrand <david@redhat.com>
To: Ankur Arora <ankur.a.arora@oracle.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: akpm@linux-foundation.org, bp@alien8.de,
dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
mjguzik@gmail.com, luto@kernel.org, peterz@infradead.org,
acme@kernel.org, namhyung@kernel.org, tglx@linutronix.de,
willy@infradead.org, raghavendra.kt@amd.com,
boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
Date: Wed, 16 Jul 2025 00:08:57 +0200 [thread overview]
Message-ID: <d6413d17-c530-4553-9eca-dec8dce37e7e@redhat.com> (raw)
In-Reply-To: <20250710005926.1159009-14-ankur.a.arora@oracle.com>
On 10.07.25 02:59, Ankur Arora wrote:
> folio_zero_user() is constrained to clear in a page-at-a-time
> fashion because it supports CONFIG_HIGHMEM which means that kernel
> mappings for pages in a folio are not guaranteed to be contiguous.
>
> We don't have this problem when running under configurations with
> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
> longer page-extents.
> This is expected to be faster because the processor can now optimize
> the clearing based on the knowledge of the extent.
>
> However, clearing in larger chunks can have two other problems:
>
> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
> (larger folios don't have any expectation of cache locality).
>
> - preemption latency when clearing large folios.
>
> Handle the first by splitting the clearing in three parts: the
> faulting page and its immediate locality, its left and right
> regions; the local neighbourhood is cleared last.
>
> The second problem is relevant only when running under cooperative
> preemption models. Limit the worst case preemption latency by clearing
> in architecture specified ARCH_CLEAR_PAGE_EXTENT units.
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> mm/memory.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 85 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..c52806270375 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7034,6 +7034,7 @@ static inline int process_huge_page(
> return 0;
> }
>
> +#ifndef CONFIG_CLEAR_PAGE_EXTENT
> static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
> unsigned int nr_pages)
> {
> @@ -7058,7 +7059,10 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
> /**
> * folio_zero_user - Zero a folio which will be mapped to userspace.
> * @folio: The folio to zero.
> - * @addr_hint: The address will be accessed or the base address if uncelar.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * folio_zero_user() uses clear_gigantic_page() or process_huge_page() to
> + * do page-at-a-time zeroing because it needs to handle CONFIG_HIGHMEM.
> */
> void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> {
> @@ -7070,6 +7074,86 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
> }
>
> +#else /* CONFIG_CLEAR_PAGE_EXTENT */
> +
> +static void clear_pages_resched(void *addr, int npages)
> +{
> + int i, remaining;
> +
> + if (preempt_model_preemptible()) {
> + clear_pages(addr, npages);
> + goto out;
> + }
> +
> + for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
> + clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
> + ARCH_CLEAR_PAGE_EXTENT);
> + cond_resched();
> + }
> +
> + remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
> +
> + if (remaining)
> + clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
> + remaining);
> +out:
> + cond_resched();
> +}
> +
> +/*
> + * folio_zero_user - Zero a folio which will be mapped to userspace.
> + * @folio: The folio to zero.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * Uses architectural support for clear_pages() to zero page extents
> + * instead of clearing page-at-a-time.
> + *
> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
> + * pages in the immediate locality of the faulting page, and its left, right
> + * regions; the local neighbourhood cleared last in order to keep cache
> + * lines of the target region hot.
> + *
> + * For larger folios we assume that there is no expectation of cache locality
> + * and just do a straight zero.
> + */
> +void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> +{
> + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> + const int width = 2; /* number of pages cleared last on either side */
> + struct range r[3];
> + int i;
> +
> + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> + clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
> + return;
> + }
> +
> + /*
> + * Faulting page and its immediate neighbourhood. Cleared at the end to
> + * ensure it sticks around in the cache.
> + */
> + r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> + clamp_t(s64, fault_idx + width, pg.start, pg.end));
> +
> + /* Region to the left of the fault */
> + r[1] = DEFINE_RANGE(pg.start,
> + clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
> +
> + /* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> + r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> + pg.end);
> +
> + for (i = 0; i <= 2; i++) {
> + int npages = range_len(&r[i]);
> +
> + if (npages > 0)
> + clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
> + }
> +}
> +#endif /* CONFIG_CLEAR_PAGE_EXTENT */
> +
> static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
> unsigned long addr_hint,
> struct vm_area_struct *vma,
So, folio_zero_user() is only compiled with THP | HUGETLB already.
What we should probably do is scrap the whole new kconfig option and
do something like this in here:
diff --git a/mm/memory.c b/mm/memory.c
index 3dd6c57e6511e..64b6bd3e7657a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7009,19 +7009,53 @@ static inline int process_huge_page(
return 0;
}
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
- unsigned int nr_pages)
+#ifdef CONFIG_ARCH_HAS_CLEAR_PAGES
+static void clear_user_highpages_resched(struct page *page,
+ unsigned int nr_pages, unsigned long addr)
+{
+ void *addr = page_address(page);
+ int i, remaining;
+
+ /*
+ * CONFIG_ARCH_HAS_CLEAR_PAGES is not expected to be set on systems
+ * with HIGHMEM, so we can safely use clear_pages().
+ */
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHMEM));
+
+ if (preempt_model_preemptible()) {
+ clear_pages(addr, npages);
+ goto out;
+ }
+
+ for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
+ clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
+ ARCH_CLEAR_PAGE_EXTENT);
+ cond_resched();
+ }
+
+ remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
+
+ if (remaining)
+ clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
+ remaining);
+out:
+ cond_resched();
+}
+#else
+static void clear_user_highpages_resched(struct page *page,
+ unsigned int nr_pages, unsigned long addr)
{
- unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
int i;
might_sleep();
for (i = 0; i < nr_pages; i++) {
cond_resched();
- clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
+ clear_user_highpage(nth_page(page, i), addr + i * PAGE_SIZE);
}
}
+#endif /* CONFIG_ARCH_HAS_CLEAR_PAGES */
+
static int clear_subpage(unsigned long addr, int idx, void *arg)
{
struct folio *folio = arg;
@@ -7030,19 +7064,76 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
return 0;
}
-/**
+static void folio_zero_user_huge(struct folio *folio, unsigned long addr_hint)
+{
+ const unsigned int nr_pages = folio_nr_pages(folio);
+ const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
+ const long fault_idx = (addr_hint - addr) / PAGE_SIZE;
+ const struct range pg = DEFINE_RANGE(0, nr_pages - 1);
+ const int width = 2; /* number of pages cleared last on either side */
+ struct range r[3];
+ int i;
+
+ /*
+ * Without an optimized clear_user_highpages_resched(), we'll perform
+ * some extra magic dance around the faulting address.
+ */
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_CLEAR_PAGES)) {
+ process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+ return;
+ }
+
+ /*
+ * Faulting page and its immediate neighbourhood. Cleared at the end to
+ * ensure it sticks around in the cache.
+ */
+ r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+ clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+ /* Region to the left of the fault */
+ r[1] = DEFINE_RANGE(pg.start,
+ clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+ /* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+ r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+ pg.end);
+
+ for (i = 0; i <= 2; i++) {
+ unsigned int cur_nr_pages = range_len(&r[i]);
+ struct page *cur_page = folio_page(folio, r[i].start);
+ unsigned long cur_addr = addr + folio_page_idx(folio, cur_page) * PAGE_SIZE;
+
+ if (cur_nr_pages > 0)
+ clear_user_highpages_resched(cur_page, cur_nr_pages, cur_addr);
+ }
+}
+
+/*
* folio_zero_user - Zero a folio which will be mapped to userspace.
* @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
*/
void folio_zero_user(struct folio *folio, unsigned long addr_hint)
{
- unsigned int nr_pages = folio_nr_pages(folio);
+ const unsigned int nr_pages = folio_nr_pages(folio);
+ const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
- if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
- clear_gigantic_page(folio, addr_hint, nr_pages);
- else
- process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+ if (unlikely(nr_pages >= MAX_ORDER_NR_PAGES)) {
+ clear_user_highpages_resched(folio_page(folio, 0), nr_pages, addr);
+ return;
+ }
+ folio_zero_user_huge(folio, addr_hint);
}
static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
--
2.50.1
Note that this probably completely broken in various ways, just to give you
an idea.
*maybe* we could change clear_user_highpages_resched to something like
folio_zero_user_range(), consuming a folio + idx instead of a page. That might
or might not be better here.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-07-15 22:09 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-10 0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
2025-07-10 0:59 ` [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement Ankur Arora
2025-07-15 20:04 ` Namhyung Kim
2025-07-10 0:59 ` [PATCH v5 02/14] perf bench mem: Defer type munging of size to float Ankur Arora
2025-07-15 20:05 ` Namhyung Kim
2025-07-16 2:17 ` Ankur Arora
2025-07-10 0:59 ` [PATCH v5 03/14] perf bench mem: Move mem op parameters into a structure Ankur Arora
2025-07-15 20:06 ` Namhyung Kim
2025-07-10 0:59 ` [PATCH v5 04/14] perf bench mem: Pull out init/fini logic Ankur Arora
2025-07-15 20:09 ` Namhyung Kim
2025-07-10 0:59 ` [PATCH v5 05/14] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
2025-07-15 20:09 ` Namhyung Kim
2025-07-10 0:59 ` [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages Ankur Arora
2025-07-15 20:12 ` Namhyung Kim
2025-07-16 2:32 ` Ankur Arora
2025-07-10 0:59 ` [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region Ankur Arora
2025-07-15 20:17 ` Namhyung Kim
2025-07-16 2:34 ` Ankur Arora
2025-07-10 0:59 ` [PATCH v5 08/14] perf bench mem: Refactor mem_options Ankur Arora
2025-07-15 20:18 ` Namhyung Kim
2025-07-10 0:59 ` [PATCH v5 09/14] perf bench mem: Add mmap() workloads Ankur Arora
2025-07-15 20:20 ` Namhyung Kim
2025-07-16 2:40 ` Ankur Arora
2025-07-10 0:59 ` [PATCH v5 10/14] x86/mm: Simplify clear_page_* Ankur Arora
2025-07-11 11:47 ` David Hildenbrand
2025-07-11 17:26 ` Ankur Arora
2025-07-11 19:03 ` David Hildenbrand
2025-07-11 19:24 ` Ankur Arora
2025-07-11 19:27 ` David Hildenbrand
2025-07-10 0:59 ` [PATCH v5 11/14] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-07-10 0:59 ` [PATCH v5 12/14] mm: add config option for clearing page-extents Ankur Arora
2025-07-10 7:58 ` Andrew Morton
2025-07-10 16:31 ` Ankur Arora
2025-07-11 11:39 ` David Hildenbrand
2025-07-11 17:25 ` Ankur Arora
2025-07-11 19:14 ` David Hildenbrand
2025-07-11 19:35 ` Ankur Arora
2025-07-11 11:40 ` David Hildenbrand
2025-07-11 17:32 ` Ankur Arora
2025-07-11 19:26 ` David Hildenbrand
2025-07-11 19:42 ` Ankur Arora
2025-07-14 20:35 ` Ankur Arora
2025-07-15 20:59 ` David Hildenbrand
2025-07-10 0:59 ` [PATCH v5 13/14] mm: memory: support " Ankur Arora
2025-07-11 11:44 ` David Hildenbrand
2025-07-11 13:27 ` Raghavendra K T
2025-07-11 17:39 ` Ankur Arora
2025-07-15 22:08 ` David Hildenbrand [this message]
2025-07-16 3:19 ` Ankur Arora
2025-07-16 8:03 ` David Hildenbrand
2025-07-16 17:54 ` Ankur Arora
2025-07-10 0:59 ` [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents Ankur Arora
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d6413d17-c530-4553-9eca-dec8dce37e7e@redhat.com \
--to=david@redhat.com \
--cc=acme@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ankur.a.arora@oracle.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=mjguzik@gmail.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=tglx@linutronix.de \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).