linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
@ 2025-10-27 20:21 Ankur Arora
  2025-10-27 20:21 ` [PATCH v8 1/7] treewide: provide a generic clear_user_page() variant Ankur Arora
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

This series adds clearing of contiguous page ranges for hugepages,
improving on the current page-at-a-time approach in two ways:

 - amortizes the per-page setup cost over a larger extent
 - when using string instructions, exposes the real region size
   to the processor.

A processor could use a knowledge of the extent to optimize the
clearing. AMD Zen uarchs, as an example, elide allocation of
cachelines for regions larger than L3-size.

Demand faulting a 64GB region shows performance improvements:

 $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5

                       baseline              +series             change

                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*

   pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%	preempt=none|voluntary
   pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than the maximum extent used on x86
(ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
with pg-sz=1GB.

The anon-w-seq test in the vm-scalability benchmark, however, does show
worse performance with utime increasing by ~9%:

                         stime                  utime

  baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
  +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )

In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is a likely uncommon pattern where the memory bandwidth
is saturated while also being cache limited because we access the
entire region.

Raghavendra also tested previous version of the series on AMD Genoa [1].

Changelog:

v8:
 - make clear_user_highpages(), clear_user_pages() and clear_pages()
   more robust across architectures. (Thanks David!)
 - split up folio_zero_user() changes into ones for clearing contiguous
   regions and those for maintaining temporal locality since they have
   different performance profiles (Suggested by Andrew Morton.)
 - added Raghavendra's Reviewed-by, Tested-by.
 - get rid of nth_page()
 - perf related patches have been pulled already. Remove them.

v7:
 - interface cleanups, comments for clear_user_highpages(), clear_user_pages(),
   clear_pages().
 - fixed build errors flagged by kernel test robot
 (https://lore.kernel.org/lkml/20250917152418.4077386-1-ankur.a.arora@oracle.com/)

v6:
 - perf bench mem: update man pages and other cleanups (Namhyung Kim)
 - unify folio_zero_user() for HIGHMEM, !HIGHMEM options instead of
   working through a new config option (David Hildenbrand).
   - cleanups and simlification around that.
 (https://lore.kernel.org/lkml/20250902080816.3715913-1-ankur.a.arora@oracle.com/)

v5:
 - move the non HIGHMEM implementation of folio_zero_user() from x86
   to common code (Dave Hansen)
 - Minor naming cleanups, commit messages etc
 (https://lore.kernel.org/lkml/20250710005926.1159009-1-ankur.a.arora@oracle.com/)

v4:
 - adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
 - inline stosb etc (PeterZ)
 - handle cooperative preemption models (Ingo)
 - interface and other cleanups all over (Ingo)
 (https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)

v3:
 - get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
   was limited to preempt=full|lazy.
 - override folio_zero_user() (Linus)
 (https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)

v2:
 - addressed review comments from peterz, tglx.
 - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
 - General code cleanup
 (https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages.v7

[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/

Ankur Arora (6):
  mm: introduce clear_pages() and clear_user_pages()
  mm/highmem: introduce clear_user_highpages()
  x86/mm: Simplify clear_page_*
  x86/clear_page: Introduce clear_pages()
  mm, folio_zero_user: support clearing page ranges
  mm: folio_zero_user: cache neighbouring pages

David Hildenbrand (1):
  treewide: provide a generic clear_user_page() variant

 arch/alpha/include/asm/page.h      |  1 -
 arch/arc/include/asm/page.h        |  2 +
 arch/arm/include/asm/page-nommu.h  |  1 -
 arch/arm64/include/asm/page.h      |  1 -
 arch/csky/abiv1/inc/abi/page.h     |  1 +
 arch/csky/abiv2/inc/abi/page.h     |  7 ---
 arch/hexagon/include/asm/page.h    |  1 -
 arch/loongarch/include/asm/page.h  |  1 -
 arch/m68k/include/asm/page_mm.h    |  1 +
 arch/m68k/include/asm/page_no.h    |  1 -
 arch/microblaze/include/asm/page.h |  1 -
 arch/mips/include/asm/page.h       |  1 +
 arch/nios2/include/asm/page.h      |  1 +
 arch/openrisc/include/asm/page.h   |  1 -
 arch/parisc/include/asm/page.h     |  1 -
 arch/powerpc/include/asm/page.h    |  1 +
 arch/riscv/include/asm/page.h      |  1 -
 arch/s390/include/asm/page.h       |  1 -
 arch/sparc/include/asm/page_32.h   |  2 +
 arch/sparc/include/asm/page_64.h   |  1 +
 arch/um/include/asm/page.h         |  1 -
 arch/x86/include/asm/page.h        |  6 ---
 arch/x86/include/asm/page_32.h     |  6 +++
 arch/x86/include/asm/page_64.h     | 64 ++++++++++++++++++-----
 arch/x86/lib/clear_page_64.S       | 39 +++-----------
 arch/xtensa/include/asm/page.h     |  1 -
 include/linux/highmem.h            | 29 +++++++++++
 include/linux/mm.h                 | 69 +++++++++++++++++++++++++
 mm/memory.c                        | 82 ++++++++++++++++++++++--------
 mm/util.c                          | 13 +++++
 30 files changed, 247 insertions(+), 91 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
@ 2025-10-28 17:15 Ankur Arora
  0 siblings, 0 replies; 28+ messages in thread
From: Ankur Arora @ 2025-10-28 17:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

References: <20251027202109.678022-1-ankur.a.arora@oracle.com>
 <20251027143309.4331a65f38f05ea95d9e46ad@linux-foundation.org>
User-agent: mu4e 1.4.10; emacs 27.2
In-reply-to: <20251027143309.4331a65f38f05ea95d9e46ad@linux-foundation.org>

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 27 Oct 2025 13:21:02 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> This series adds clearing of contiguous page ranges for hugepages,
>> improving on the current page-at-a-time approach in two ways:
>>
>>  - amortizes the per-page setup cost over a larger extent
>>  - when using string instructions, exposes the real region size
>>    to the processor.
>>
>> A processor could use a knowledge of the extent to optimize the
>> clearing. AMD Zen uarchs, as an example, elide allocation of
>> cachelines for regions larger than L3-size.
>>
>> Demand faulting a 64GB region shows performance improvements:
>>
>>  $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
>>
>>                        baseline              +series             change
>>
>>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>
>>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
>>
>>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%	preempt=none|voluntary
>>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy
>>
>> [#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, which is higher than the maximum extent used on x86
>> (ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
>> with pg-sz=1GB.
>
> I wasn't understanding this preemption thing at all, but then I saw this
> in the v4 series changelogging:
>
> : [#] Only with preempt=full|lazy because cooperatively preempted models
> : need regular invocations of cond_resched(). This limits the extent
> : sizes that can be cleared as a unit.
>
> Please put this back in!!

/me facepalms. Sorry you had to go that far back.
Yeah, that doesn't make any kind of sense standalone. Will fix.

> It's possible that we're being excessively aggressive with those
> cond_resched()s.  Have you investigating tuning their frequency so we
> can use larger extent sizes with these preemption models?

folio_zero_user() does a small part of that: for 2MB pages the clearing
is split in three parts with an intervening cond_resched() for each.

This is of course much simpler than the process_huge_page() approach where
we do a left right dance around the faulting page.

I had implemented a version of process_huge_page() with larger extent
sizes that narrowed as we got closer to the faulting page in [a] (x86
performance was similar to the current series. See [b]).

In hindsight however, that felt too elaborate and probably unnecessary
on most modern systems where you have reasonably large caches.
Where it might help, however, is on more cache constrained systems where
the spatial locality really does matter.

So, my idea was to start with a simple version, get some testing and
then fill in the gaps instead of starting with something like [a].


[a] https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/#r
[b] https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/


>> The anon-w-seq test in the vm-scalability benchmark, however, does show
>> worse performance with utime increasing by ~9%:
>>
>>                          stime                  utime
>>
>>   baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>   +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>
>> In part this is because anon-w-seq runs with 384 processes zeroing
>> anonymously mapped memory which they then access sequentially. As
>> such this is a likely uncommon pattern where the memory bandwidth
>> is saturated while also being cache limited because we access the
>> entire region.
>>
>> Raghavendra also tested previous version of the series on AMD Genoa [1].
>
> I suggest you paste Raghavendra's results into this [0/N] - it's
> important material.

Will do.

>>
>> ...
>>
>>  arch/alpha/include/asm/page.h      |  1 -
>>  arch/arc/include/asm/page.h        |  2 +
>>  arch/arm/include/asm/page-nommu.h  |  1 -
>>  arch/arm64/include/asm/page.h      |  1 -
>>  arch/csky/abiv1/inc/abi/page.h     |  1 +
>>  arch/csky/abiv2/inc/abi/page.h     |  7 ---
>>  arch/hexagon/include/asm/page.h    |  1 -
>>  arch/loongarch/include/asm/page.h  |  1 -
>>  arch/m68k/include/asm/page_mm.h    |  1 +
>>  arch/m68k/include/asm/page_no.h    |  1 -
>>  arch/microblaze/include/asm/page.h |  1 -
>>  arch/mips/include/asm/page.h       |  1 +
>>  arch/nios2/include/asm/page.h      |  1 +
>>  arch/openrisc/include/asm/page.h   |  1 -
>>  arch/parisc/include/asm/page.h     |  1 -
>>  arch/powerpc/include/asm/page.h    |  1 +
>>  arch/riscv/include/asm/page.h      |  1 -
>>  arch/s390/include/asm/page.h       |  1 -
>>  arch/sparc/include/asm/page_32.h   |  2 +
>>  arch/sparc/include/asm/page_64.h   |  1 +
>>  arch/um/include/asm/page.h         |  1 -
>>  arch/x86/include/asm/page.h        |  6 ---
>>  arch/x86/include/asm/page_32.h     |  6 +++
>>  arch/x86/include/asm/page_64.h     | 64 ++++++++++++++++++-----
>>  arch/x86/lib/clear_page_64.S       | 39 +++-----------
>>  arch/xtensa/include/asm/page.h     |  1 -
>>  include/linux/highmem.h            | 29 +++++++++++
>>  include/linux/mm.h                 | 69 +++++++++++++++++++++++++
>>  mm/memory.c                        | 82 ++++++++++++++++++++++--------
>>  mm/util.c                          | 13 +++++
>>  30 files changed, 247 insertions(+), 91 deletions(-)
>
> I guess this is an mm.git thing, with x86 acks (please).

Ack that.

> The documented review activity is rather thin at this time so I'll sit
> this out for a while.  Please ping me next week and we can reassess,

Will do. And, thanks for the quick look!

--
ankur
Date: Tue, 28 Oct 2025 10:15:38 -0700
Message-ID: <87zf9bq75x.fsf@oracle.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-11-11  6:25 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
2025-10-27 20:21 ` [PATCH v8 1/7] treewide: provide a generic clear_user_page() variant Ankur Arora
2025-10-27 20:21 ` [PATCH v8 2/7] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
2025-11-07  8:47   ` David Hildenbrand (Red Hat)
2025-10-27 20:21 ` [PATCH v8 3/7] mm/highmem: introduce clear_user_highpages() Ankur Arora
2025-11-07  8:48   ` David Hildenbrand (Red Hat)
2025-11-10  7:20     ` Ankur Arora
2025-10-27 20:21 ` [PATCH v8 4/7] x86/mm: Simplify clear_page_* Ankur Arora
2025-10-28 13:36   ` Borislav Petkov
2025-10-29 23:26     ` Ankur Arora
2025-10-30  0:17       ` Borislav Petkov
2025-10-30  5:21         ` Ankur Arora
2025-10-27 20:21 ` [PATCH v8 5/7] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-10-28 13:56   ` Borislav Petkov
2025-10-28 18:51     ` Ankur Arora
2025-10-29 22:57       ` Borislav Petkov
2025-10-29 23:31         ` Ankur Arora
2025-10-27 20:21 ` [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges Ankur Arora
2025-11-07  8:59   ` David Hildenbrand (Red Hat)
2025-11-10  7:20     ` Ankur Arora
2025-11-10  8:57       ` David Hildenbrand (Red Hat)
2025-11-11  6:24         ` Ankur Arora
2025-10-27 20:21 ` [PATCH v8 7/7] mm: folio_zero_user: cache neighbouring pages Ankur Arora
2025-10-27 21:33 ` [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Andrew Morton
2025-10-28 17:22   ` Ankur Arora
2025-11-07  5:33     ` Ankur Arora
2025-11-07  8:59       ` David Hildenbrand (Red Hat)
  -- strict thread matches above, loose matches on Subject: below --
2025-10-28 17:15 Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).