linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ankur Arora <ankur.a.arora@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	akpm@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
	mjguzik@gmail.com, luto@kernel.org, peterz@infradead.org,
	acme@kernel.org, namhyung@kernel.org, tglx@linutronix.de,
	willy@infradead.org, raghavendra.kt@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
Date: Wed, 16 Jul 2025 10:54:33 -0700	[thread overview]
Message-ID: <878qkof2cm.fsf@oracle.com> (raw)
In-Reply-To: <213a4333-24de-4216-8d9a-b70ac52c4263@redhat.com>


David Hildenbrand <david@redhat.com> writes:

> On 16.07.25 05:19, Ankur Arora wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 10.07.25 02:59, Ankur Arora wrote:
>>>> folio_zero_user() is constrained to clear in a page-at-a-time
>>>> fashion because it supports CONFIG_HIGHMEM which means that kernel
>>>> mappings for pages in a folio are not guaranteed to be contiguous.
>>>> We don't have this problem when running under configurations with
>>>> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
>>>> longer page-extents.
>>>> This is expected to be faster because the processor can now optimize
>>>> the clearing based on the knowledge of the extent.
>>>> However, clearing in larger chunks can have two other problems:
>>>>    - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>>>>      (larger folios don't have any expectation of cache locality).
>>>>    - preemption latency when clearing large folios.
>>>> Handle the first by splitting the clearing in three parts: the
>>>> faulting page and its immediate locality, its left and right
>>>> regions; the local neighbourhood is cleared last.
>>>> The second problem is relevant only when running under cooperative
>>>> preemption models. Limit the worst case preemption latency by clearing
>>>> in architecture specified ARCH_CLEAR_PAGE_EXTENT units.
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>>    mm/memory.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 85 insertions(+), 1 deletion(-)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index b0cda5aab398..c52806270375 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -7034,6 +7034,7 @@ static inline int process_huge_page(
>>>>    	return 0;
>>>>    }
>>>>    +#ifndef CONFIG_CLEAR_PAGE_EXTENT
>>>>    static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>>>>    				unsigned int nr_pages)
>>>>    {
>>>> @@ -7058,7 +7059,10 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
>>>>    /**
>>>>     * folio_zero_user - Zero a folio which will be mapped to userspace.
>>>>     * @folio: The folio to zero.
>>>> - * @addr_hint: The address will be accessed or the base address if uncelar.
>>>> + * @addr_hint: The address accessed by the user or the base address.
>>>> + *
>>>> + * folio_zero_user() uses clear_gigantic_page() or process_huge_page() to
>>>> + * do page-at-a-time zeroing because it needs to handle CONFIG_HIGHMEM.
>>>>     */
>>>>    void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>>    {
>>>> @@ -7070,6 +7074,86 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>>    		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>>>    }
>>>>    +#else /* CONFIG_CLEAR_PAGE_EXTENT */
>>>> +
>>>> +static void clear_pages_resched(void *addr, int npages)
>>>> +{
>>>> +	int i, remaining;
>>>> +
>>>> +	if (preempt_model_preemptible()) {
>>>> +		clear_pages(addr, npages);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
>>>> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
>>>> +			    ARCH_CLEAR_PAGE_EXTENT);
>>>> +		cond_resched();
>>>> +	}
>>>> +
>>>> +	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
>>>> +
>>>> +	if (remaining)
>>>> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
>>>> +			    remaining);
>>>> +out:
>>>> +	cond_resched();
>>>> +}
>>>> +
>>>> +/*
>>>> + * folio_zero_user - Zero a folio which will be mapped to userspace.
>>>> + * @folio: The folio to zero.
>>>> + * @addr_hint: The address accessed by the user or the base address.
>>>> + *
>>>> + * Uses architectural support for clear_pages() to zero page extents
>>>> + * instead of clearing page-at-a-time.
>>>> + *
>>>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>>>> + * pages in the immediate locality of the faulting page, and its left, right
>>>> + * regions; the local neighbourhood cleared last in order to keep cache
>>>> + * lines of the target region hot.
>>>> + *
>>>> + * For larger folios we assume that there is no expectation of cache locality
>>>> + * and just do a straight zero.
>>>> + */
>>>> +void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>> +{
>>>> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>>>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>>>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>>>> +	const int width = 2; /* number of pages cleared last on either side */
>>>> +	struct range r[3];
>>>> +	int i;
>>>> +
>>>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>>>> +		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>>>> +	 * ensure it sticks around in the cache.
>>>> +	 */
>>>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>>>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>>>> +
>>>> +	/* Region to the left of the fault */
>>>> +	r[1] = DEFINE_RANGE(pg.start,
>>>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>>>> +
>>>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>>>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>>>> +			    pg.end);
>>>> +
>>>> +	for (i = 0; i <= 2; i++) {
>>>> +		int npages = range_len(&r[i]);
>>>> +
>>>> +		if (npages > 0)
>>>> +			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
>>>> +	}
>>>> +}
>>>> +#endif /* CONFIG_CLEAR_PAGE_EXTENT */
>>>> +
>>>>    static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>>>>    				   unsigned long addr_hint,
>>>>    				   struct vm_area_struct *vma,
>>>
>>> So, folio_zero_user() is only compiled with THP | HUGETLB already.
>>>
>>> What we should probably do is scrap the whole new kconfig option and
>>> do something like this in here:
>> So, in principle I don't disagree and unifying both of these is cleaner
>> than introducing a whole new option.
>
> Yes, after playing with the code, a new config option just for that is not
> what we want.
>
>> However that still leaves this code having to contort around CONFIG_HIGHMEM
>> which is probably even less frequently used than THP | HUGETLB.
>
> Not sure I understand your question correctly, but thp+hugetlb are compatible with
> 32bit and highmem.
>
> There are plans of removing highmem support, but that's a different story :)

Oh that would be a godsend! Whenever that happens.

> I think as long as these configs exist, we should just support them, although
> performance is a secondary concern.
>
>> Maybe we should get rid of ARCH_HAS_CLEAR_PAGES completely and everyone
>> with !HIGHMEM either use a generic version of clear_pages() which loops
>> and calls clear_page() or some architectural override.
>> And, then we can do a similar transformation with copy_pages() (and
>> copy_user_large_folio()).
>> At that point, process_huge_page() is used only for !HIGHMEM configs
>
> I assume you meant HIGHMEM

Oh yeah.

>> configs which likely have relatively small caches and so that leaves
>> it probably over-engineered.
>
> I don't think we need to jump through hoops to optimize performance on
> highmem, yes.
>
>> The thing that gives me pause is that non-x86 might perform worse
>> when they switch away from the left-right zeroing approach in
>> process_huge_page() to a generic clear_pages().
>
> Right. Or they perform better. Hard to know.
>
>> So, maybe allowing architectures to opt in by having to define
>> ARCH_HAS_CLEAR_PAGES would allow doing this in a more measured fashion.
>
> One tricky thing is dealing with architectures where clear_user_highpage()
> does cachemanagement.

Oh yeah, I was forgetting that.

> So the more I think about it, I wonder if we really should just design it
> all around clear_user_highpages and clear_user_pages, and have only a
> single clearing algorithm.

Great. This is exactly what I was hoping to eventually get to.

> Essentially, something like the following, just that we need a generic
> clear_user_pages that iterates over clear_user_page.
>
> Then, x86_64 could simply implement clear_user_pages by routing it to your
> clear_pages, and define CLEAR_PAGES_RESCHED_NR (although I wonder if we can
> do better here).

Agreed.

So, essentially just have the lower layer interfaces in place (generic
and arch specific where needed):
 clear_pages()
 clear_user_pages()
 clear_user_highpages()

With the arch defining whichever of those it needs (and ARCH_CLEAR_PAGES_RESCHED_NR).

And, a folio_zero_user() pretty much as below.

> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index 6234f316468c9..031e19c56765b 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -264,6 +264,14 @@ static inline void tag_clear_highpage(struct page *page)
>  #ifdef CONFIG_HIGHMEM
>  void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
>  		unsigned start2, unsigned end2);
> +static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
> +		unsigned int nr_pages)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i <= nr_pages; i++)
> +		clear_user_highpage(nth_page(page, i), vaddr + i * PAGE_SIZE);
> +}
>  #else
>  static inline void zero_user_segments(struct page *page,
>  		unsigned start1, unsigned end1,
> @@ -284,6 +292,7 @@ static inline void zero_user_segments(struct page *page,
>  	for (i = 0; i < compound_nr(page); i++)
>  		flush_dcache_page(page + i);
>  }
> +#define clear_user_highpages clear_user_pages
>  #endif
>    static inline void zero_user_segment(struct page *page,
> diff --git a/mm/memory.c b/mm/memory.c
> index 3dd6c57e6511e..8aebf6e0765d8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7009,40 +7009,92 @@ static inline int process_huge_page(
>  	return 0;
>  }
>  -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
> -				unsigned int nr_pages)
> +#ifndef CLEAR_PAGES_RESCHED_NR
> +#define CLEAR_PAGES_RESCHED_NR		1
> +#endif /* CLEAR_PAGES_RESCHED_NR */
> +
> +static void clear_user_highpages_resched(struct page *page, unsigned long addr,
> +		unsigned int nr_pages)
>  {
> -	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> -	int i;
> +	unsigned int i, remaining;
>  -	might_sleep();
> -	for (i = 0; i < nr_pages; i++) {
> +	if (preempt_model_preemptible()) {
> +		clear_user_highpages(page, addr, nr_pages);
> +		goto out;
> +	}
> +
> +	for (i = 0; i < nr_pages / CLEAR_PAGES_RESCHED_NR; i++) {
> +		clear_user_highpages(nth_page(page, i * CLEAR_PAGES_RESCHED_NR),
> +				     addr + i * CLEAR_PAGES_RESCHED_NR * PAGE_SIZE,
> +				     CLEAR_PAGES_RESCHED_NR);
> -		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>  		cond_resched();
>  	}
> -}
>  -static int clear_subpage(unsigned long addr, int idx, void *arg)
> -{
> -	struct folio *folio = arg;
> +	remaining = nr_pages % CLEAR_PAGES_RESCHED_NR;
>  -	clear_user_highpage(folio_page(folio, idx), addr);
> -	return 0;
> +	if (remaining)
> +		clear_user_highpages(nth_page(page, i * CLEAR_PAGES_RESCHED_NR),
> +				     addr + i * CLEAR_PAGES_RESCHED_NR * PAGE_SHIFT,
> +				     remaining);
> +out:
> +	cond_resched();
>  }
>  -/**
> +/*
>   * folio_zero_user - Zero a folio which will be mapped to userspace.
>   * @folio: The folio to zero.
> - * @addr_hint: The address will be accessed or the base address if uncelar.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * Uses architectural support for clear_pages() to zero page extents
> + * instead of clearing page-at-a-time.
> + *
> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
> + * pages in the immediate locality of the faulting page, and its left, right
> + * regions; the local neighbourhood cleared last in order to keep cache
> + * lines of the target region hot.
> + *
> + * For larger folios we assume that there is no expectation of cache locality
> + * and just do a straight zero.
>   */
>  void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>  {
> -	unsigned int nr_pages = folio_nr_pages(folio);
> +	const unsigned int nr_pages = folio_nr_pages(folio);
> +	const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
> +	const long fault_idx = (addr_hint - addr) / PAGE_SIZE;
> +	const struct range pg = DEFINE_RANGE(0, nr_pages - 1);
> +	const int width = 2; /* number of pages cleared last on either side */
> +	struct range r[3];
> +	int i;
> +
> +	if (unlikely(nr_pages >= MAX_ORDER_NR_PAGES)) {
> +		clear_user_highpages_resched(folio_page(folio, 0), addr, nr_pages);
> +		return;
> +	}
> +
> +	/*
> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
> +	 * ensure it sticks around in the cache.
> +	 */
> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
> +
> +	/* Region to the left of the fault */
> +	r[1] = DEFINE_RANGE(pg.start,
> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
> +
> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> +			    pg.end);
> +
> +	for (i = 0; i <= 2; i++) {
> +		unsigned int cur_nr_pages = range_len(&r[i]);
> +		struct page *cur_page = folio_page(folio, r[i].start);
> +		unsigned long cur_addr = addr + folio_page_idx(folio, cur_page) * PAGE_SIZE;
> +
> +		if (cur_nr_pages > 0)
> +			clear_user_highpages_resched(cur_page, cur_addr, cur_nr_pages);
> +	}
>  -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> -		clear_gigantic_page(folio, addr_hint, nr_pages);
> -	else
> -		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>  }
>    static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
> --
> 2.50.1
>
>
> On highmem we'd simply process individual pages, who cares.
>
> On !highmem, we'd use the optimized clear_user_pages -> clear_pages implementation
> if available. Otherwise, we clear individual pages.
>
> Yes, we'd lose the left-right pattern.
>
> If really important we could somehow let the architecture opt in and do the call
> to the existing process function.

Great. Alright let me work on this.

And, thanks for the very helpful comments.

--
ankur

  reply	other threads:[~2025-07-16 17:55 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
2025-07-10  0:59 ` [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement Ankur Arora
2025-07-15 20:04   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 02/14] perf bench mem: Defer type munging of size to float Ankur Arora
2025-07-15 20:05   ` Namhyung Kim
2025-07-16  2:17     ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 03/14] perf bench mem: Move mem op parameters into a structure Ankur Arora
2025-07-15 20:06   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 04/14] perf bench mem: Pull out init/fini logic Ankur Arora
2025-07-15 20:09   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 05/14] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
2025-07-15 20:09   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages Ankur Arora
2025-07-15 20:12   ` Namhyung Kim
2025-07-16  2:32     ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region Ankur Arora
2025-07-15 20:17   ` Namhyung Kim
2025-07-16  2:34     ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 08/14] perf bench mem: Refactor mem_options Ankur Arora
2025-07-15 20:18   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 09/14] perf bench mem: Add mmap() workloads Ankur Arora
2025-07-15 20:20   ` Namhyung Kim
2025-07-16  2:40     ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 10/14] x86/mm: Simplify clear_page_* Ankur Arora
2025-07-11 11:47   ` David Hildenbrand
2025-07-11 17:26     ` Ankur Arora
2025-07-11 19:03       ` David Hildenbrand
2025-07-11 19:24         ` Ankur Arora
2025-07-11 19:27           ` David Hildenbrand
2025-07-10  0:59 ` [PATCH v5 11/14] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-07-10  0:59 ` [PATCH v5 12/14] mm: add config option for clearing page-extents Ankur Arora
2025-07-10  7:58   ` Andrew Morton
2025-07-10 16:31     ` Ankur Arora
2025-07-11 11:39   ` David Hildenbrand
2025-07-11 17:25     ` Ankur Arora
2025-07-11 19:14       ` David Hildenbrand
2025-07-11 19:35         ` Ankur Arora
2025-07-11 11:40   ` David Hildenbrand
2025-07-11 17:32     ` Ankur Arora
2025-07-11 19:26       ` David Hildenbrand
2025-07-11 19:42         ` Ankur Arora
2025-07-14 20:35         ` Ankur Arora
2025-07-15 20:59           ` David Hildenbrand
2025-07-10  0:59 ` [PATCH v5 13/14] mm: memory: support " Ankur Arora
2025-07-11 11:44   ` David Hildenbrand
2025-07-11 13:27     ` Raghavendra K T
2025-07-11 17:39     ` Ankur Arora
2025-07-15 22:08   ` David Hildenbrand
2025-07-16  3:19     ` Ankur Arora
2025-07-16  8:03       ` David Hildenbrand
2025-07-16 17:54         ` Ankur Arora [this message]
2025-07-10  0:59 ` [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents Ankur Arora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878qkof2cm.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=hpa@zytor.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).