linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents
@ 2025-07-10  0:59 Ankur Arora
  2025-07-10  0:59 ` [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement Ankur Arora
                   ` (13 more replies)
  0 siblings, 14 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

This series adds clearing of page-extents for hugepages, improving on the
current page-at-a-time approach in two ways:

 - amortizes the per-page setup cost over a larger extent

 - when using string instructions, exposes the real region size to the
   processor. A processor could use that as a hint to optimize based
   on the full extent size. AMD Zen uarchs, as an example, elide
   allocation of cachelines for regions larger than L3-size.

Demand faulting a 64GB region shows performance improvements:

 $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%	preempt=*

   pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98% [#]   +  1.6%	preempt=none|voluntary
   pg-sz=1GB       17.51  +- 1.19%        43.23  +-  5.22%       +146.8%	preempt=full|lazy

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than ARCH_CLEAR_PAGE_EXTENT, so
preempt=none|voluntary sees no improvement with pg-sz=1GB.

Raghavendra also tested v3/v4 on AMD Genoa and sees similar improvement [1].

Structure of the series:

Patches 1-5, 8,
  "perf bench mem: Remove repetition around time measurement"
  "perf bench mem: Defer type munging of size to float"
  "perf bench mem: Move mem op parameters into a structure"
  "perf bench mem: Pull out init/fini logic"
  "perf bench mem: Switch from zalloc() to mmap()"
  "perf bench mem: Refactor mem_options"

refactor, and patches 6-7, 9
  "perf bench mem: Allow mapping of hugepages"
  "perf bench mem: Allow chunking on a memory region"
  "perf bench mem: Add mmap() workload"

add a few new perf bench mem workloads (chunking and mapping
performance).

Patches 10-11,
  "x86/mm: Simplify clear_page_*"
  "x86/clear_page: Introduce clear_pages()"

inlines the ERMS and REP_GOOD implementations used from clear_page()
and adds clear_pages() to handle page extents.

Patches 12-13,
  "mm: add config option for clearing page-extents"
  "mm: memory: support clearing page-extents"

adds support to do extent zeroing via folio_zero_user().

And, finally patch 14,
  "x86/clear_pages: Support clearing of page-extents"

adds x86 support so folio_zero_user() can take advantage of
clear_pages().

Changelog:

v5:
 - move the non HIGHMEM implementation of folio_zero_user() from x86
   to common code (Dave Hansen)
 - Minor naming cleanups, commit messages etc

v4:
 - adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
 - inline stosb etc (PeterZ)
 - handle cooperative preemption models (Ingo)
 - interface and other cleanups all over (Ingo)
 (https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)

v3:
 - get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
   was limited to preempt=full|lazy.
 - override folio_zero_user() (Linus)
 (https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)

v2:
 - addressed review comments from peterz, tglx.
 - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
 - General code cleanup
 (https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages.v5

[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/

Ankur Arora (14):
  perf bench mem: Remove repetition around time measurement
  perf bench mem: Defer type munging of size to float
  perf bench mem: Move mem op parameters into a structure
  perf bench mem: Pull out init/fini logic
  perf bench mem: Switch from zalloc() to mmap()
  perf bench mem: Allow mapping of hugepages
  perf bench mem: Allow chunking on a memory region
  perf bench mem: Refactor mem_options
  perf bench mem: Add mmap() workloads
  x86/mm: Simplify clear_page_*
  x86/clear_page: Introduce clear_pages()
  mm: add config option for clearing page-extents
  mm: memory: support clearing page-extents
  x86/clear_pages: Support clearing of page-extents

 arch/x86/Kconfig                             |   4 +
 arch/x86/include/asm/page_32.h               |  17 +-
 arch/x86/include/asm/page_64.h               |  63 ++-
 arch/x86/lib/clear_page_64.S                 |  39 +-
 mm/Kconfig                                   |   9 +
 mm/memory.c                                  |  86 +++-
 tools/perf/bench/bench.h                     |   1 +
 tools/perf/bench/mem-functions.c             | 391 ++++++++++++++-----
 tools/perf/bench/mem-memcpy-arch.h           |   2 +-
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |   4 +
 tools/perf/bench/mem-memset-arch.h           |   2 +-
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   4 +
 tools/perf/builtin-bench.c                   |   1 +
 13 files changed, 472 insertions(+), 151 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:04   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 02/14] perf bench mem: Defer type munging of size to float Ankur Arora
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

We have two copies of each mem benchmark: one using cycles to
measure time, the second for gettimeofday().

Unify.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 110 +++++++++++++------------------
 1 file changed, 46 insertions(+), 64 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 19d45c377ac1..8599ed96ee1f 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -51,6 +51,11 @@ static const struct option options[] = {
 	OPT_END()
 };
 
+union bench_clock {
+	u64		cycles;
+	struct timeval	tv;
+};
+
 typedef void *(*memcpy_t)(void *, const void *, size_t);
 typedef void *(*memset_t)(void *, int, size_t);
 
@@ -91,6 +96,26 @@ static u64 get_cycles(void)
 	return clk;
 }
 
+static void clock_get(union bench_clock *t)
+{
+	if (use_cycles)
+		t->cycles = get_cycles();
+	else
+		BUG_ON(gettimeofday(&t->tv, NULL));
+}
+
+static union bench_clock clock_diff(union bench_clock *s, union bench_clock *e)
+{
+	union bench_clock t;
+
+	if (use_cycles)
+		t.cycles = e->cycles - s->cycles;
+	else
+		timersub(&e->tv, &s->tv, &t.tv);
+
+	return t;
+}
+
 static double timeval2double(struct timeval *ts)
 {
 	return (double)ts->tv_sec + (double)ts->tv_usec / (double)USEC_PER_SEC;
@@ -109,8 +134,7 @@ static double timeval2double(struct timeval *ts)
 
 struct bench_mem_info {
 	const struct function *functions;
-	u64 (*do_cycles)(const struct function *r, size_t size, void *src, void *dst);
-	double (*do_gettimeofday)(const struct function *r, size_t size, void *src, void *dst);
+	union bench_clock (*do_op)(const struct function *r, size_t size, void *src, void *dst);
 	const char *const *usage;
 	bool alloc_src;
 };
@@ -119,7 +143,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 {
 	const struct function *r = &info->functions[r_idx];
 	double result_bps = 0.0;
-	u64 result_cycles = 0;
+	union bench_clock rt = { 0 };
 	void *src = NULL, *dst = zalloc(size);
 
 	printf("# function '%s' (%s)\n", r->name, r->desc);
@@ -136,25 +160,23 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 	if (bench_format == BENCH_FORMAT_DEFAULT)
 		printf("# Copying %s bytes ...\n\n", size_str);
 
-	if (use_cycles) {
-		result_cycles = info->do_cycles(r, size, src, dst);
-	} else {
-		result_bps = info->do_gettimeofday(r, size, src, dst);
-	}
+	rt = info->do_op(r, size, src, dst);
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
 		if (use_cycles) {
-			printf(" %14lf cycles/byte\n", (double)result_cycles/size_total);
+			printf(" %14lf cycles/byte\n", (double)rt.cycles/size_total);
 		} else {
+			result_bps = size_total/timeval2double(&rt.tv);
 			print_bps(result_bps);
 		}
 		break;
 
 	case BENCH_FORMAT_SIMPLE:
 		if (use_cycles) {
-			printf("%lf\n", (double)result_cycles/size_total);
+			printf("%lf\n", (double)rt.cycles/size_total);
 		} else {
+			result_bps = size_total/timeval2double(&rt.tv);
 			printf("%lf\n", result_bps);
 		}
 		break;
@@ -235,38 +257,21 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
 	fn(dst, src, size);
 }
 
-static u64 do_memcpy_cycles(const struct function *r, size_t size, void *src, void *dst)
+static union bench_clock do_memcpy(const struct function *r, size_t size,
+				   void *src, void *dst)
 {
-	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	union bench_clock start, end;
 	memcpy_t fn = r->fn.memcpy;
 	int i;
 
 	memcpy_prefault(fn, size, src, dst);
 
-	cycle_start = get_cycles();
+	clock_get(&start);
 	for (i = 0; i < nr_loops; ++i)
 		fn(dst, src, size);
-	cycle_end = get_cycles();
+	clock_get(&end);
 
-	return cycle_end - cycle_start;
-}
-
-static double do_memcpy_gettimeofday(const struct function *r, size_t size, void *src, void *dst)
-{
-	struct timeval tv_start, tv_end, tv_diff;
-	memcpy_t fn = r->fn.memcpy;
-	int i;
-
-	memcpy_prefault(fn, size, src, dst);
-
-	BUG_ON(gettimeofday(&tv_start, NULL));
-	for (i = 0; i < nr_loops; ++i)
-		fn(dst, src, size);
-	BUG_ON(gettimeofday(&tv_end, NULL));
-
-	timersub(&tv_end, &tv_start, &tv_diff);
-
-	return (double)(((double)size * nr_loops) / timeval2double(&tv_diff));
+	return clock_diff(&start, &end);
 }
 
 struct function memcpy_functions[] = {
@@ -292,8 +297,7 @@ int bench_mem_memcpy(int argc, const char **argv)
 {
 	struct bench_mem_info info = {
 		.functions		= memcpy_functions,
-		.do_cycles		= do_memcpy_cycles,
-		.do_gettimeofday	= do_memcpy_gettimeofday,
+		.do_op			= do_memcpy,
 		.usage			= bench_mem_memcpy_usage,
 		.alloc_src              = true,
 	};
@@ -301,9 +305,10 @@ int bench_mem_memcpy(int argc, const char **argv)
 	return bench_mem_common(argc, argv, &info);
 }
 
-static u64 do_memset_cycles(const struct function *r, size_t size, void *src __maybe_unused, void *dst)
+static union bench_clock do_memset(const struct function *r, size_t size,
+				   void *src __maybe_unused, void *dst)
 {
-	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	union bench_clock start, end;
 	memset_t fn = r->fn.memset;
 	int i;
 
@@ -313,34 +318,12 @@ static u64 do_memset_cycles(const struct function *r, size_t size, void *src __m
 	 */
 	fn(dst, -1, size);
 
-	cycle_start = get_cycles();
+	clock_get(&start);
 	for (i = 0; i < nr_loops; ++i)
 		fn(dst, i, size);
-	cycle_end = get_cycles();
+	clock_get(&end);
 
-	return cycle_end - cycle_start;
-}
-
-static double do_memset_gettimeofday(const struct function *r, size_t size, void *src __maybe_unused, void *dst)
-{
-	struct timeval tv_start, tv_end, tv_diff;
-	memset_t fn = r->fn.memset;
-	int i;
-
-	/*
-	 * We prefault the freshly allocated memory range here,
-	 * to not measure page fault overhead:
-	 */
-	fn(dst, -1, size);
-
-	BUG_ON(gettimeofday(&tv_start, NULL));
-	for (i = 0; i < nr_loops; ++i)
-		fn(dst, i, size);
-	BUG_ON(gettimeofday(&tv_end, NULL));
-
-	timersub(&tv_end, &tv_start, &tv_diff);
-
-	return (double)(((double)size * nr_loops) / timeval2double(&tv_diff));
+	return clock_diff(&start, &end);
 }
 
 static const char * const bench_mem_memset_usage[] = {
@@ -366,8 +349,7 @@ int bench_mem_memset(int argc, const char **argv)
 {
 	struct bench_mem_info info = {
 		.functions		= memset_functions,
-		.do_cycles		= do_memset_cycles,
-		.do_gettimeofday	= do_memset_gettimeofday,
+		.do_op			= do_memset,
 		.usage			= bench_mem_memset_usage,
 	};
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 02/14] perf bench mem: Defer type munging of size to float
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
  2025-07-10  0:59 ` [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:05   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 03/14] perf bench mem: Move mem op parameters into a structure Ankur Arora
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Do type conversion to double at the point of use.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 8599ed96ee1f..b8f020379197 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -139,7 +139,7 @@ struct bench_mem_info {
 	bool alloc_src;
 };
 
-static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, double size_total)
+static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, size_t size_total)
 {
 	const struct function *r = &info->functions[r_idx];
 	double result_bps = 0.0;
@@ -165,18 +165,18 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
 		if (use_cycles) {
-			printf(" %14lf cycles/byte\n", (double)rt.cycles/size_total);
+			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)size_total);
 		} else {
-			result_bps = size_total/timeval2double(&rt.tv);
+			result_bps = (double)size_total/timeval2double(&rt.tv);
 			print_bps(result_bps);
 		}
 		break;
 
 	case BENCH_FORMAT_SIMPLE:
 		if (use_cycles) {
-			printf("%lf\n", (double)rt.cycles/size_total);
+			printf("%lf\n", (double)rt.cycles/(double)size_total);
 		} else {
-			result_bps = size_total/timeval2double(&rt.tv);
+			result_bps = (double)size_total/timeval2double(&rt.tv);
 			printf("%lf\n", result_bps);
 		}
 		break;
@@ -199,7 +199,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 {
 	int i;
 	size_t size;
-	double size_total;
+	size_t size_total;
 
 	argc = parse_options(argc, argv, options, info->usage, 0);
 
@@ -212,7 +212,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	}
 
 	size = (size_t)perf_atoll((char *)size_str);
-	size_total = (double)size * nr_loops;
+	size_total = (size_t)size * nr_loops;
 
 	if ((s64)size <= 0) {
 		fprintf(stderr, "Invalid size:%s\n", size_str);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 03/14] perf bench mem: Move mem op parameters into a structure
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
  2025-07-10  0:59 ` [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement Ankur Arora
  2025-07-10  0:59 ` [PATCH v5 02/14] perf bench mem: Defer type munging of size to float Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:06   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 04/14] perf bench mem: Pull out init/fini logic Ankur Arora
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Move benchmark function parameters in struct bench_params.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 63 +++++++++++++++++---------------
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index b8f020379197..fb17d36a6f6c 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -30,7 +30,7 @@
 
 static const char	*size_str	= "1MB";
 static const char	*function_str	= "all";
-static int		nr_loops	= 1;
+static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
 
@@ -42,7 +42,7 @@ static const struct option options[] = {
 	OPT_STRING('f', "function", &function_str, "all",
 		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
 
-	OPT_INTEGER('l', "nr_loops", &nr_loops,
+	OPT_UINTEGER('l', "nr_loops", &nr_loops,
 		    "Specify the number of loops to run. (default: 1)"),
 
 	OPT_BOOLEAN('c', "cycles", &use_cycles,
@@ -56,6 +56,12 @@ union bench_clock {
 	struct timeval	tv;
 };
 
+struct bench_params {
+	size_t		size;
+	size_t		size_total;
+	unsigned int	nr_loops;
+};
+
 typedef void *(*memcpy_t)(void *, const void *, size_t);
 typedef void *(*memset_t)(void *, int, size_t);
 
@@ -134,17 +140,19 @@ static double timeval2double(struct timeval *ts)
 
 struct bench_mem_info {
 	const struct function *functions;
-	union bench_clock (*do_op)(const struct function *r, size_t size, void *src, void *dst);
+	union bench_clock (*do_op)(const struct function *r, struct bench_params *p,
+				   void *src, void *dst);
 	const char *const *usage;
 	bool alloc_src;
 };
 
-static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, size_t size_total)
+static void __bench_mem_function(struct bench_mem_info *info, struct bench_params *p,
+				 int r_idx)
 {
 	const struct function *r = &info->functions[r_idx];
 	double result_bps = 0.0;
 	union bench_clock rt = { 0 };
-	void *src = NULL, *dst = zalloc(size);
+	void *src = NULL, *dst = zalloc(p->size);
 
 	printf("# function '%s' (%s)\n", r->name, r->desc);
 
@@ -152,7 +160,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 		goto out_alloc_failed;
 
 	if (info->alloc_src) {
-		src = zalloc(size);
+		src = zalloc(p->size);
 		if (src == NULL)
 			goto out_alloc_failed;
 	}
@@ -160,23 +168,23 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 	if (bench_format == BENCH_FORMAT_DEFAULT)
 		printf("# Copying %s bytes ...\n\n", size_str);
 
-	rt = info->do_op(r, size, src, dst);
+	rt = info->do_op(r, p, src, dst);
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
 		if (use_cycles) {
-			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)size_total);
+			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)p->size_total);
 		} else {
-			result_bps = (double)size_total/timeval2double(&rt.tv);
+			result_bps = (double)p->size_total/timeval2double(&rt.tv);
 			print_bps(result_bps);
 		}
 		break;
 
 	case BENCH_FORMAT_SIMPLE:
 		if (use_cycles) {
-			printf("%lf\n", (double)rt.cycles/(double)size_total);
+			printf("%lf\n", (double)rt.cycles/(double)p->size_total);
 		} else {
-			result_bps = (double)size_total/timeval2double(&rt.tv);
+			result_bps = (double)p->size_total/timeval2double(&rt.tv);
 			printf("%lf\n", result_bps);
 		}
 		break;
@@ -198,8 +206,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *info)
 {
 	int i;
-	size_t size;
-	size_t size_total;
+	struct bench_params p = { 0 };
 
 	argc = parse_options(argc, argv, options, info->usage, 0);
 
@@ -211,17 +218,17 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 		}
 	}
 
-	size = (size_t)perf_atoll((char *)size_str);
-	size_total = (size_t)size * nr_loops;
-
-	if ((s64)size <= 0) {
+	p.nr_loops = nr_loops;
+	p.size = (size_t)perf_atoll((char *)size_str);
+	if ((s64)p.size <= 0) {
 		fprintf(stderr, "Invalid size:%s\n", size_str);
 		return 1;
 	}
+	p.size_total = (size_t)p.size * p.nr_loops;
 
 	if (!strncmp(function_str, "all", 3)) {
 		for (i = 0; info->functions[i].name; i++)
-			__bench_mem_function(info, i, size, size_total);
+			__bench_mem_function(info, &p, i);
 		return 0;
 	}
 
@@ -240,7 +247,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 		return 1;
 	}
 
-	__bench_mem_function(info, i, size, size_total);
+	__bench_mem_function(info, &p, i);
 
 	return 0;
 }
@@ -257,18 +264,17 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
 	fn(dst, src, size);
 }
 
-static union bench_clock do_memcpy(const struct function *r, size_t size,
+static union bench_clock do_memcpy(const struct function *r, struct bench_params *p,
 				   void *src, void *dst)
 {
 	union bench_clock start, end;
 	memcpy_t fn = r->fn.memcpy;
-	int i;
 
-	memcpy_prefault(fn, size, src, dst);
+	memcpy_prefault(fn, p->size, src, dst);
 
 	clock_get(&start);
-	for (i = 0; i < nr_loops; ++i)
-		fn(dst, src, size);
+	for (unsigned int i = 0; i < p->nr_loops; ++i)
+		fn(dst, src, p->size);
 	clock_get(&end);
 
 	return clock_diff(&start, &end);
@@ -305,22 +311,21 @@ int bench_mem_memcpy(int argc, const char **argv)
 	return bench_mem_common(argc, argv, &info);
 }
 
-static union bench_clock do_memset(const struct function *r, size_t size,
+static union bench_clock do_memset(const struct function *r, struct bench_params *p,
 				   void *src __maybe_unused, void *dst)
 {
 	union bench_clock start, end;
 	memset_t fn = r->fn.memset;
-	int i;
 
 	/*
 	 * We prefault the freshly allocated memory range here,
 	 * to not measure page fault overhead:
 	 */
-	fn(dst, -1, size);
+	fn(dst, -1, p->size);
 
 	clock_get(&start);
-	for (i = 0; i < nr_loops; ++i)
-		fn(dst, i, size);
+	for (unsigned int i = 0; i < p->nr_loops; ++i)
+		fn(dst, i, p->size);
 	clock_get(&end);
 
 	return clock_diff(&start, &end);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 04/14] perf bench mem: Pull out init/fini logic
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (2 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 03/14] perf bench mem: Move mem op parameters into a structure Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:09   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 05/14] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

No functional change.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c             | 103 +++++++++++++------
 tools/perf/bench/mem-memcpy-arch.h           |   2 +-
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |   4 +
 tools/perf/bench/mem-memset-arch.h           |   2 +-
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   4 +
 5 files changed, 81 insertions(+), 34 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index fb17d36a6f6c..06d3ee6f5d69 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -62,15 +62,31 @@ struct bench_params {
 	unsigned int	nr_loops;
 };
 
+struct bench_mem_info {
+	const struct function *functions;
+	int (*do_op)(const struct function *r, struct bench_params *p,
+		     void *src, void *dst, union bench_clock *rt);
+	const char *const *usage;
+	bool alloc_src;
+};
+
+typedef bool (*mem_init_t)(struct bench_mem_info *, struct bench_params *,
+			   void **, void **);
+typedef void (*mem_fini_t)(struct bench_mem_info *, struct bench_params *,
+			   void **, void **);
 typedef void *(*memcpy_t)(void *, const void *, size_t);
 typedef void *(*memset_t)(void *, int, size_t);
 
 struct function {
 	const char *name;
 	const char *desc;
-	union {
-		memcpy_t memcpy;
-		memset_t memset;
+	struct {
+		mem_init_t init;
+		mem_fini_t fini;
+		union {
+			memcpy_t memcpy;
+			memset_t memset;
+		};
 	} fn;
 };
 
@@ -138,37 +154,24 @@ static double timeval2double(struct timeval *ts)
 			printf(" %14lf GB/sec\n", x / K / K / K);	\
 	} while (0)
 
-struct bench_mem_info {
-	const struct function *functions;
-	union bench_clock (*do_op)(const struct function *r, struct bench_params *p,
-				   void *src, void *dst);
-	const char *const *usage;
-	bool alloc_src;
-};
-
 static void __bench_mem_function(struct bench_mem_info *info, struct bench_params *p,
 				 int r_idx)
 {
 	const struct function *r = &info->functions[r_idx];
 	double result_bps = 0.0;
 	union bench_clock rt = { 0 };
-	void *src = NULL, *dst = zalloc(p->size);
+	void *src = NULL, *dst = NULL;
 
 	printf("# function '%s' (%s)\n", r->name, r->desc);
 
-	if (dst == NULL)
-		goto out_alloc_failed;
-
-	if (info->alloc_src) {
-		src = zalloc(p->size);
-		if (src == NULL)
-			goto out_alloc_failed;
-	}
+	if (r->fn.init && r->fn.init(info, p, &src, &dst))
+		goto out_init_failed;
 
 	if (bench_format == BENCH_FORMAT_DEFAULT)
 		printf("# Copying %s bytes ...\n\n", size_str);
 
-	rt = info->do_op(r, p, src, dst);
+	if (info->do_op(r, p, src, dst, &rt))
+		goto out_test_failed;
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
@@ -194,11 +197,11 @@ static void __bench_mem_function(struct bench_mem_info *info, struct bench_param
 		break;
 	}
 
+out_test_failed:
 out_free:
-	free(src);
-	free(dst);
+	if (r->fn.fini) r->fn.fini(info, p, &src, &dst);
 	return;
-out_alloc_failed:
+out_init_failed:
 	printf("# Memory allocation failed - maybe size (%s) is too large?\n", size_str);
 	goto out_free;
 }
@@ -264,8 +267,8 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
 	fn(dst, src, size);
 }
 
-static union bench_clock do_memcpy(const struct function *r, struct bench_params *p,
-				   void *src, void *dst)
+static int do_memcpy(const struct function *r, struct bench_params *p,
+		     void *src, void *dst, union bench_clock *rt)
 {
 	union bench_clock start, end;
 	memcpy_t fn = r->fn.memcpy;
@@ -277,16 +280,47 @@ static union bench_clock do_memcpy(const struct function *r, struct bench_params
 		fn(dst, src, p->size);
 	clock_get(&end);
 
-	return clock_diff(&start, &end);
+	*rt = clock_diff(&start, &end);
+
+	return 0;
+}
+
+static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
+		      void **src, void **dst)
+{
+	bool failed;
+
+	*dst = zalloc(p->size);
+	failed = *dst == NULL;
+
+	if (info->alloc_src) {
+		*src = zalloc(p->size);
+		failed = failed || *src == NULL;
+	}
+
+	return failed;
+}
+
+static void mem_free(struct bench_mem_info *info __maybe_unused,
+		     struct bench_params *p __maybe_unused,
+		     void **src, void **dst)
+{
+	free(*dst);
+	free(*src);
+
+	*dst = *src = NULL;
 }
 
 struct function memcpy_functions[] = {
 	{ .name		= "default",
 	  .desc		= "Default memcpy() provided by glibc",
+	  .fn.init	= mem_alloc,
+	  .fn.fini	= mem_free,
 	  .fn.memcpy	= memcpy },
 
 #ifdef HAVE_ARCH_X86_64_SUPPORT
-# define MEMCPY_FN(_fn, _name, _desc) {.name = _name, .desc = _desc, .fn.memcpy = _fn},
+# define MEMCPY_FN(_fn, _init, _fini, _name, _desc)	\
+	{.name = _name, .desc = _desc, .fn.memcpy = _fn, .fn.init = _init, .fn.fini = _fini },
 # include "mem-memcpy-x86-64-asm-def.h"
 # undef MEMCPY_FN
 #endif
@@ -311,8 +345,8 @@ int bench_mem_memcpy(int argc, const char **argv)
 	return bench_mem_common(argc, argv, &info);
 }
 
-static union bench_clock do_memset(const struct function *r, struct bench_params *p,
-				   void *src __maybe_unused, void *dst)
+static int do_memset(const struct function *r, struct bench_params *p,
+		     void *src __maybe_unused, void *dst, union bench_clock *rt)
 {
 	union bench_clock start, end;
 	memset_t fn = r->fn.memset;
@@ -328,7 +362,9 @@ static union bench_clock do_memset(const struct function *r, struct bench_params
 		fn(dst, i, p->size);
 	clock_get(&end);
 
-	return clock_diff(&start, &end);
+	*rt = clock_diff(&start, &end);
+
+	return 0;
 }
 
 static const char * const bench_mem_memset_usage[] = {
@@ -339,10 +375,13 @@ static const char * const bench_mem_memset_usage[] = {
 static const struct function memset_functions[] = {
 	{ .name		= "default",
 	  .desc		= "Default memset() provided by glibc",
+	  .fn.init	= mem_alloc,
+	  .fn.fini	= mem_free,
 	  .fn.memset	= memset },
 
 #ifdef HAVE_ARCH_X86_64_SUPPORT
-# define MEMSET_FN(_fn, _name, _desc) { .name = _name, .desc = _desc, .fn.memset = _fn },
+# define MEMSET_FN(_fn, _init, _fini, _name, _desc) \
+	{.name = _name, .desc = _desc, .fn.memset = _fn, .fn.init = _init, .fn.fini = _fini },
 # include "mem-memset-x86-64-asm-def.h"
 # undef MEMSET_FN
 #endif
diff --git a/tools/perf/bench/mem-memcpy-arch.h b/tools/perf/bench/mem-memcpy-arch.h
index 5bcaec5601a8..852e48cfd8fe 100644
--- a/tools/perf/bench/mem-memcpy-arch.h
+++ b/tools/perf/bench/mem-memcpy-arch.h
@@ -2,7 +2,7 @@
 
 #ifdef HAVE_ARCH_X86_64_SUPPORT
 
-#define MEMCPY_FN(fn, name, desc)		\
+#define MEMCPY_FN(fn, init, fini, name, desc)		\
 	void *fn(void *, const void *, size_t);
 
 #include "mem-memcpy-x86-64-asm-def.h"
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
index 6188e19d3129..f43038f4448b 100644
--- a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
@@ -1,9 +1,13 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 MEMCPY_FN(memcpy_orig,
+	mem_alloc,
+	mem_free,
 	"x86-64-unrolled",
 	"unrolled memcpy() in arch/x86/lib/memcpy_64.S")
 
 MEMCPY_FN(__memcpy,
+	mem_alloc,
+	mem_free,
 	"x86-64-movsq",
 	"movsq-based memcpy() in arch/x86/lib/memcpy_64.S")
diff --git a/tools/perf/bench/mem-memset-arch.h b/tools/perf/bench/mem-memset-arch.h
index 53f45482663f..278c5da12d63 100644
--- a/tools/perf/bench/mem-memset-arch.h
+++ b/tools/perf/bench/mem-memset-arch.h
@@ -2,7 +2,7 @@
 
 #ifdef HAVE_ARCH_X86_64_SUPPORT
 
-#define MEMSET_FN(fn, name, desc)		\
+#define MEMSET_FN(fn, init, fini, name, desc)	\
 	void *fn(void *, int, size_t);
 
 #include "mem-memset-x86-64-asm-def.h"
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index 247c72fdfb9d..80ad1b7ea770 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,9 +1,13 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 MEMSET_FN(memset_orig,
+	mem_alloc,
+	mem_free,
 	"x86-64-unrolled",
 	"unrolled memset() in arch/x86/lib/memset_64.S")
 
 MEMSET_FN(__memset,
+	mem_alloc,
+	mem_free,
 	"x86-64-stosq",
 	"movsq-based memset() in arch/x86/lib/memset_64.S")
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 05/14] perf bench mem: Switch from zalloc() to mmap()
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (3 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 04/14] perf bench mem: Pull out init/fini logic Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:09   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages Ankur Arora
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Using mmap() ensures that the buffer is always aligned at a fixed
boundary. Switch to that to remove one source of variability.

Since we always want to read/write from the the allocated buffers map
with pagetables pre-populated.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 06d3ee6f5d69..914f9048d982 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -22,9 +22,9 @@
 #include <string.h>
 #include <unistd.h>
 #include <sys/time.h>
+#include <sys/mman.h>
 #include <errno.h>
 #include <linux/time64.h>
-#include <linux/zalloc.h>
 
 #define K 1024
 
@@ -285,16 +285,33 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
 	return 0;
 }
 
+static void *bench_mmap(size_t size, bool populate)
+{
+	void *p;
+	int extra = populate ? MAP_POPULATE : 0;
+
+	p = mmap(NULL, size, PROT_READ|PROT_WRITE,
+		 extra | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+
+	return p == MAP_FAILED ? NULL : p;
+}
+
+static void bench_munmap(void *p, size_t size)
+{
+	if (p)
+		munmap(p, size);
+}
+
 static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
 		      void **src, void **dst)
 {
 	bool failed;
 
-	*dst = zalloc(p->size);
+	*dst = bench_mmap(p->size, true);
 	failed = *dst == NULL;
 
 	if (info->alloc_src) {
-		*src = zalloc(p->size);
+		*src = bench_mmap(p->size, true);
 		failed = failed || *src == NULL;
 	}
 
@@ -305,8 +322,8 @@ static void mem_free(struct bench_mem_info *info __maybe_unused,
 		     struct bench_params *p __maybe_unused,
 		     void **src, void **dst)
 {
-	free(*dst);
-	free(*src);
+	bench_munmap(*dst, p->size);
+	bench_munmap(*src, p->size);
 
 	*dst = *src = NULL;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (4 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 05/14] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:12   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region Ankur Arora
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Page sizes that can be selected: 4KB, 2MB, 1GB.

Both the reservation and node from which hugepages are allocated
from are expected to be addressed by the user.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 33 ++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 914f9048d982..e4d713587d45 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -25,11 +25,17 @@
 #include <sys/mman.h>
 #include <errno.h>
 #include <linux/time64.h>
+#include <linux/log2.h>
 
 #define K 1024
 
+#define PAGE_SHIFT_4KB		12
+#define PAGE_SHIFT_2MB		21
+#define PAGE_SHIFT_1GB		30
+
 static const char	*size_str	= "1MB";
 static const char	*function_str	= "all";
+static const char	*page_size_str	= "4KB";
 static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
@@ -39,6 +45,10 @@ static const struct option options[] = {
 		    "Specify the size of the memory buffers. "
 		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
 
+	OPT_STRING('p', "page", &page_size_str, "4KB",
+		    "Specify page-size for mapping memory buffers. "
+		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
+
 	OPT_STRING('f', "function", &function_str, "all",
 		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
 
@@ -60,6 +70,7 @@ struct bench_params {
 	size_t		size;
 	size_t		size_total;
 	unsigned int	nr_loops;
+	unsigned int	page_shift;
 };
 
 struct bench_mem_info {
@@ -202,7 +213,8 @@ static void __bench_mem_function(struct bench_mem_info *info, struct bench_param
 	if (r->fn.fini) r->fn.fini(info, p, &src, &dst);
 	return;
 out_init_failed:
-	printf("# Memory allocation failed - maybe size (%s) is too large?\n", size_str);
+	printf("# Memory allocation failed - maybe size (%s) %s?\n", size_str,
+			p->page_shift != PAGE_SHIFT_4KB ? "has insufficient hugepages" : "is too large");
 	goto out_free;
 }
 
@@ -210,6 +222,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 {
 	int i;
 	struct bench_params p = { 0 };
+	unsigned int page_size;
 
 	argc = parse_options(argc, argv, options, info->usage, 0);
 
@@ -229,6 +242,15 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	}
 	p.size_total = (size_t)p.size * p.nr_loops;
 
+	page_size = (unsigned int)perf_atoll((char *)page_size_str);
+	if (page_size != (1 << PAGE_SHIFT_4KB) &&
+	    page_size != (1 << PAGE_SHIFT_2MB) &&
+	    page_size != (1 << PAGE_SHIFT_1GB)) {
+		fprintf(stderr, "Invalid page-size:%s\n", page_size_str);
+		return 1;
+	}
+	p.page_shift = ilog2(page_size);
+
 	if (!strncmp(function_str, "all", 3)) {
 		for (i = 0; info->functions[i].name; i++)
 			__bench_mem_function(info, &p, i);
@@ -285,11 +307,14 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
 	return 0;
 }
 
-static void *bench_mmap(size_t size, bool populate)
+static void *bench_mmap(size_t size, bool populate, unsigned int page_shift)
 {
 	void *p;
 	int extra = populate ? MAP_POPULATE : 0;
 
+	if (page_shift != PAGE_SHIFT_4KB)
+		extra |= MAP_HUGETLB | (page_shift << MAP_HUGE_SHIFT);
+
 	p = mmap(NULL, size, PROT_READ|PROT_WRITE,
 		 extra | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
 
@@ -307,11 +332,11 @@ static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
 {
 	bool failed;
 
-	*dst = bench_mmap(p->size, true);
+	*dst = bench_mmap(p->size, true, p->page_shift);
 	failed = *dst == NULL;
 
 	if (info->alloc_src) {
-		*src = bench_mmap(p->size, true);
+		*src = bench_mmap(p->size, true, p->page_shift);
 		failed = failed || *src == NULL;
 	}
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (5 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:17   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 08/14] perf bench mem: Refactor mem_options Ankur Arora
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

There can be a significant gap in memset/memcpy performance depending
on the size of the region being operated on.

With chunk-size=4kb:

  $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

  $ perf bench mem memset -p 4kb -k 4kb -s 4gb -l 10 -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4gb bytes ...

      13.011655 GB/sec

With chunk-size=1gb:

  $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

  $ perf bench mem memset -p 4kb -k 1gb -s 4gb -l 10 -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4gb bytes ...

      21.936355 GB/sec

So, allow the user to specify the chunk-size.

The default value is identical to the total size of the region, which
preserves current behaviour.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index e4d713587d45..412d18f2cb2e 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -36,6 +36,7 @@
 static const char	*size_str	= "1MB";
 static const char	*function_str	= "all";
 static const char	*page_size_str	= "4KB";
+static const char	*chunk_size_str	= "0";
 static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
@@ -49,6 +50,10 @@ static const struct option options[] = {
 		    "Specify page-size for mapping memory buffers. "
 		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
 
+	OPT_STRING('k', "chunk", &chunk_size_str, "0",
+		    "Specify the chunk-size for each invocation. "
+		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
+
 	OPT_STRING('f', "function", &function_str, "all",
 		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
 
@@ -69,6 +74,7 @@ union bench_clock {
 struct bench_params {
 	size_t		size;
 	size_t		size_total;
+	size_t		chunk_size;
 	unsigned int	nr_loops;
 	unsigned int	page_shift;
 };
@@ -242,6 +248,14 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	}
 	p.size_total = (size_t)p.size * p.nr_loops;
 
+	p.chunk_size = (size_t)perf_atoll((char *)chunk_size_str);
+	if ((s64)p.chunk_size < 0 || (s64)p.chunk_size > (s64)p.size) {
+		fprintf(stderr, "Invalid chunk_size:%s\n", chunk_size_str);
+		return 1;
+	}
+	if (!p.chunk_size)
+		p.chunk_size = p.size;
+
 	page_size = (unsigned int)perf_atoll((char *)page_size_str);
 	if (page_size != (1 << PAGE_SHIFT_4KB) &&
 	    page_size != (1 << PAGE_SHIFT_2MB) &&
@@ -299,7 +313,8 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
 
 	clock_get(&start);
 	for (unsigned int i = 0; i < p->nr_loops; ++i)
-		fn(dst, src, p->size);
+		for (size_t off = 0; off < p->size; off += p->chunk_size)
+			fn(dst + off, src + off, min(p->chunk_size, p->size - off));
 	clock_get(&end);
 
 	*rt = clock_diff(&start, &end);
@@ -401,7 +416,8 @@ static int do_memset(const struct function *r, struct bench_params *p,
 
 	clock_get(&start);
 	for (unsigned int i = 0; i < p->nr_loops; ++i)
-		fn(dst, i, p->size);
+		for (size_t off = 0; off < p->size; off += p->chunk_size)
+			fn(dst + off, i, min(p->chunk_size, p->size - off));
 	clock_get(&end);
 
 	*rt = clock_diff(&start, &end);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 08/14] perf bench mem: Refactor mem_options
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (6 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:18   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 09/14] perf bench mem: Add mmap() workloads Ankur Arora
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Split mem benchmark options into common and memset/memcpy specific.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 412d18f2cb2e..8a37da149327 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -41,7 +41,7 @@ static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
 
-static const struct option options[] = {
+static const struct option bench_common_options[] = {
 	OPT_STRING('s', "size", &size_str, "1MB",
 		    "Specify the size of the memory buffers. "
 		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
@@ -50,10 +50,6 @@ static const struct option options[] = {
 		    "Specify page-size for mapping memory buffers. "
 		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
 
-	OPT_STRING('k', "chunk", &chunk_size_str, "0",
-		    "Specify the chunk-size for each invocation. "
-		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
-
 	OPT_STRING('f', "function", &function_str, "all",
 		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
 
@@ -66,6 +62,14 @@ static const struct option options[] = {
 	OPT_END()
 };
 
+static const struct option bench_mem_options[] = {
+	OPT_STRING('k', "chunk", &chunk_size_str, "0",
+		    "Specify the chunk-size for each invocation. "
+		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
+	OPT_PARENT(bench_common_options),
+	OPT_END()
+};
+
 union bench_clock {
 	u64		cycles;
 	struct timeval	tv;
@@ -84,6 +88,7 @@ struct bench_mem_info {
 	int (*do_op)(const struct function *r, struct bench_params *p,
 		     void *src, void *dst, union bench_clock *rt);
 	const char *const *usage;
+	const struct option *options;
 	bool alloc_src;
 };
 
@@ -230,7 +235,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	struct bench_params p = { 0 };
 	unsigned int page_size;
 
-	argc = parse_options(argc, argv, options, info->usage, 0);
+	argc = parse_options(argc, argv, info->options, info->usage, 0);
 
 	if (use_cycles) {
 		i = init_cycles();
@@ -396,6 +401,7 @@ int bench_mem_memcpy(int argc, const char **argv)
 		.functions		= memcpy_functions,
 		.do_op			= do_memcpy,
 		.usage			= bench_mem_memcpy_usage,
+		.options		= bench_mem_options,
 		.alloc_src              = true,
 	};
 
@@ -453,6 +459,7 @@ int bench_mem_memset(int argc, const char **argv)
 		.functions		= memset_functions,
 		.do_op			= do_memset,
 		.usage			= bench_mem_memset_usage,
+		.options		= bench_mem_options,
 	};
 
 	return bench_mem_common(argc, argv, &info);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 09/14] perf bench mem: Add mmap() workloads
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (7 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 08/14] perf bench mem: Refactor mem_options Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-15 20:20   ` Namhyung Kim
  2025-07-10  0:59 ` [PATCH v5 10/14] x86/mm: Simplify clear_page_* Ankur Arora
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Add two mmap() workloads: one that eagerly populates a region and
another that demand faults it in.

The intent is to probe the memory subsytem performance incurred
by mmap().

  $ perf bench mem map -s 4gb -p 4kb -l 10 -f populate
  # Running 'mem/map' benchmark:
  # function 'populate' (Eagerly populated map)
  # Copying 4gb bytes ...

       1.811691 GB/sec

  $ perf bench mem map -s 4gb -p 2mb -l 10 -f populate
  # Running 'mem/map' benchmark:
  # function 'populate' (Eagerly populated map)
  # Copying 4gb bytes ...

      12.272017 GB/sec

  $ perf bench mem map -s 4gb -p 1gb -l 10 -f populate
  # Running 'mem/map' benchmark:
  # function 'populate' (Eagerly populated map)
  # Copying 4gb bytes ...

      17.085927 GB/sec

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/bench.h         |  1 +
 tools/perf/bench/mem-functions.c | 96 ++++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c       |  1 +
 3 files changed, 98 insertions(+)

diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 9f736423af53..46484bb0eefb 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -28,6 +28,7 @@ int bench_syscall_fork(int argc, const char **argv);
 int bench_syscall_execve(int argc, const char **argv);
 int bench_mem_memcpy(int argc, const char **argv);
 int bench_mem_memset(int argc, const char **argv);
+int bench_mem_map(int argc, const char **argv);
 int bench_mem_find_bit(int argc, const char **argv);
 int bench_futex_hash(int argc, const char **argv);
 int bench_futex_wake(int argc, const char **argv);
diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 8a37da149327..ea62e3583a70 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -40,6 +40,7 @@ static const char	*chunk_size_str	= "0";
 static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
+static unsigned int	seed;
 
 static const struct option bench_common_options[] = {
 	OPT_STRING('s', "size", &size_str, "1MB",
@@ -81,6 +82,7 @@ struct bench_params {
 	size_t		chunk_size;
 	unsigned int	nr_loops;
 	unsigned int	page_shift;
+	unsigned int	seed;
 };
 
 struct bench_mem_info {
@@ -98,6 +100,7 @@ typedef void (*mem_fini_t)(struct bench_mem_info *, struct bench_params *,
 			   void **, void **);
 typedef void *(*memcpy_t)(void *, const void *, size_t);
 typedef void *(*memset_t)(void *, int, size_t);
+typedef void (*map_op_t)(void *, size_t, unsigned int, bool);
 
 struct function {
 	const char *name;
@@ -108,6 +111,7 @@ struct function {
 		union {
 			memcpy_t memcpy;
 			memset_t memset;
+			map_op_t map_op;
 		};
 	} fn;
 };
@@ -160,6 +164,14 @@ static union bench_clock clock_diff(union bench_clock *s, union bench_clock *e)
 	return t;
 }
 
+static void clock_accum(union bench_clock *a, union bench_clock *b)
+{
+	if (use_cycles)
+		a->cycles += b->cycles;
+	else
+		timeradd(&a->tv, &b->tv, &a->tv);
+}
+
 static double timeval2double(struct timeval *ts)
 {
 	return (double)ts->tv_sec + (double)ts->tv_usec / (double)USEC_PER_SEC;
@@ -270,6 +282,8 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	}
 	p.page_shift = ilog2(page_size);
 
+	p.seed = seed;
+
 	if (!strncmp(function_str, "all", 3)) {
 		for (i = 0; info->functions[i].name; i++)
 			__bench_mem_function(info, &p, i);
@@ -464,3 +478,85 @@ int bench_mem_memset(int argc, const char **argv)
 
 	return bench_mem_common(argc, argv, &info);
 }
+
+static void map_page_touch(void *dst, size_t size, unsigned int page_shift, bool random)
+{
+	unsigned long npages = size / (1 << page_shift);
+	unsigned long offset = 0, r = 0;
+
+	for (unsigned long i = 0; i < npages; i++) {
+		if (random)
+			r = rand() % (1 << page_shift);
+
+		*((char *)dst + offset + r) = *(char *)(dst + offset + r) + i;
+		offset += 1 << page_shift;
+	}
+}
+
+static int do_map(const struct function *r, struct bench_params *p,
+		  void *src __maybe_unused, void *dst __maybe_unused,
+		  union bench_clock *accum)
+{
+	union bench_clock start, end, diff;
+	map_op_t fn = r->fn.map_op;
+	bool populate = strcmp(r->name, "populate") == 0;
+
+	if (p->seed)
+		srand(p->seed);
+
+	for (unsigned int i = 0; i < p->nr_loops; i++) {
+		clock_get(&start);
+		dst = bench_mmap(p->size, populate, p->page_shift);
+		if (!dst)
+			goto out;
+
+		fn(dst, p->size, p->page_shift, p->seed);
+		clock_get(&end);
+		diff = clock_diff(&start, &end);
+		clock_accum(accum, &diff);
+
+		bench_munmap(dst, p->size);
+	}
+
+	return 0;
+out:
+	printf("# Memory allocation failed - maybe size (%s) %s?\n", size_str,
+			p->page_shift != PAGE_SHIFT_4KB ? "has insufficient hugepages" : "is too large");
+	return -1;
+}
+
+static const char * const bench_mem_map_usage[] = {
+	"perf bench mem map <options>",
+	NULL
+};
+
+static const struct function map_functions[] = {
+	{ .name		= "populate",
+	  .desc		= "Eagerly populated map",
+	  .fn.map_op	= map_page_touch },
+
+	{ .name		= "demand",
+	  .desc		= "Demand loaded map",
+	  .fn.map_op	= map_page_touch },
+
+	{ .name = NULL, }
+};
+
+int bench_mem_map(int argc, const char **argv)
+{
+	static const struct option bench_map_options[] = {
+		OPT_UINTEGER('r', "randomize", &seed,
+			    "Seed to randomize page RW offset with."),
+		OPT_PARENT(bench_common_options),
+		OPT_END()
+	};
+
+	struct bench_mem_info info = {
+		.functions		= map_functions,
+		.do_op			= do_map,
+		.usage			= bench_mem_map_usage,
+		.options		= bench_map_options,
+	};
+
+	return bench_mem_common(argc, argv, &info);
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index 2c1a9f3d847a..a20bd9882f0a 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -65,6 +65,7 @@ static struct bench mem_benchmarks[] = {
 	{ "memcpy",	"Benchmark for memcpy() functions",		bench_mem_memcpy	},
 	{ "memset",	"Benchmark for memset() functions",		bench_mem_memset	},
 	{ "find_bit",	"Benchmark for find_bit() functions",		bench_mem_find_bit	},
+	{ "map",	"Benchmark for mmap() mappings",		bench_mem_map		},
 	{ "all",	"Run all memory access benchmarks",		NULL			},
 	{ NULL,		NULL,						NULL			}
 };
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 10/14] x86/mm: Simplify clear_page_*
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (8 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 09/14] perf bench mem: Add mmap() workloads Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-11 11:47   ` David Hildenbrand
  2025-07-10  0:59 ` [PATCH v5 11/14] x86/clear_page: Introduce clear_pages() Ankur Arora
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
variations. Inlining gets rid of an unnecessary CALL/RET (which isn't
free when using RETHUNK speculative execution mitigations.)
Fixup and rename clear_page_orig() to adapt to the changed calling
convention.

And, add a comment from Dave Hansen detailing various clearing mechanisms
used in clear_page().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_32.h |  6 +++++
 arch/x86/include/asm/page_64.h | 42 ++++++++++++++++++++++++++--------
 arch/x86/lib/clear_page_64.S   | 39 +++++++------------------------
 3 files changed, 46 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 0c623706cb7e..a8ff43bb9652 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -17,6 +17,12 @@ extern unsigned long __phys_addr(unsigned long);
 
 #include <linux/string.h>
 
+/*
+ * clear_page() - clear kernel page.
+ * @page: address of kernel page
+ *
+ * Does absolutely no exception handling.
+ */
 static inline void clear_page(void *page)
 {
 	memset(page, 0, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 015d23f3e01f..28b9adbc5f00 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,23 +40,45 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 #define __phys_reloc_hide(x)	(x)
 
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void memzero_page_aligned_unrolled(void *addr, u64 len);
 
+/*
+ * clear_page() - clear kernel page.
+ * @page: address of kernel page
+ *
+ * Switch between three implementations of page clearing based on CPU
+ * capabilities:
+ *
+ *  - memzero_page_aligned_unrolled(): the oldest, slowest and universally
+ *    supported method. Zeroes via 8-byte MOV instructions unrolled 8x
+ *    to write a 64-byte cacheline in each loop iteration..
+ *
+ *  - "rep stosq": really old CPUs had crummy REP implementations.
+ *    Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
+ *    trusted. The instruction writes 8-byte per REP iteration but
+ *    CPUs can internally batch these together and do larger writes.
+ *
+ *  - "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS
+ *    implementation that is less picky about alignment and where
+ *    STOSB (1-byte at a time) is actually faster than STOSQ (8-bytes
+ *    at a time.)
+ *
+ * Does absolutely no exception handling.
+ */
 static inline void clear_page(void *page)
 {
+	u64 len = PAGE_SIZE;
 	/*
 	 * Clean up KMSAN metadata for the page being cleared. The assembly call
 	 * below clobbers @page, so we perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, PAGE_SIZE);
-	alternative_call_2(clear_page_orig,
-			   clear_page_rep, X86_FEATURE_REP_GOOD,
-			   clear_page_erms, X86_FEATURE_ERMS,
-			   "=D" (page),
-			   "D" (page),
-			   "cc", "memory", "rax", "rcx");
+	kmsan_unpoison_memory(page, len);
+	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
+				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
+				   "rep stosb", X86_FEATURE_ERMS)
+			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
+			: "a" (0)
+			: "cc", "memory");
 }
 
 void copy_page(void *to, void *from);
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index a508e4a8c66a..27debe0c018c 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -6,30 +6,15 @@
 #include <asm/asm.h>
 
 /*
- * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
- * recommended to use this when possible and we do use them by default.
- * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
- * Otherwise, use original.
+ * Zero page aligned region.
+ * %rdi	- dest
+ * %rcx	- length
  */
-
-/*
- * Zero a page.
- * %rdi	- page
- */
-SYM_TYPED_FUNC_START(clear_page_rep)
-	movl $4096/8,%ecx
-	xorl %eax,%eax
-	rep stosq
-	RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
-
-SYM_TYPED_FUNC_START(clear_page_orig)
-	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+SYM_TYPED_FUNC_START(memzero_page_aligned_unrolled)
+	shrq   $6, %rcx
 	.p2align 4
 .Lloop:
-	decl	%ecx
+	decq	%rcx
 #define PUT(x) movq %rax,x*8(%rdi)
 	movq %rax,(%rdi)
 	PUT(1)
@@ -43,16 +28,8 @@ SYM_TYPED_FUNC_START(clear_page_orig)
 	jnz	.Lloop
 	nop
 	RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
-
-SYM_TYPED_FUNC_START(clear_page_erms)
-	movl $4096,%ecx
-	xorl %eax,%eax
-	rep stosb
-	RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(memzero_page_aligned_unrolled)
+EXPORT_SYMBOL_GPL(memzero_page_aligned_unrolled)
 
 /*
  * Default clear user-space.
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 11/14] x86/clear_page: Introduce clear_pages()
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (9 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 10/14] x86/mm: Simplify clear_page_* Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-10  0:59 ` [PATCH v5 12/14] mm: add config option for clearing page-extents Ankur Arora
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Performance when clearing with string instructions (x86-64-stosq and
similar) can vary significantly based on the chunk-size used.

  $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      13.748208 GB/sec

  $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in
  # arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      15.067900 GB/sec

  $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      38.104311 GB/sec

(Both on AMD Milan.)

With a change in chunk-size of 4KB to 1GB, we see the performance go
from 13.7 GB/sec to 38.1 GB/sec. For a chunk-size of 2MB the change isn't
quite as drastic but it is worth adding a clear_page() variant that can
handle contiguous page-extents.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_32.h | 15 +++++++++++----
 arch/x86/include/asm/page_64.h | 24 ++++++++++++++++--------
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index a8ff43bb9652..561f416b61e5 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -18,16 +18,23 @@ extern unsigned long __phys_addr(unsigned long);
 #include <linux/string.h>
 
 /*
- * clear_page() - clear kernel page.
- * @page: address of kernel page
+ * clear_pages() - clear kernel page range.
+ * @addr: start address of page range
+ * @npages: number of pages
  *
+ * Assumes that (@addr, +@npages) references a kernel region.
  * Does absolutely no exception handling.
  */
-static inline void clear_page(void *page)
+static inline void clear_pages(void *addr, u64 npages)
 {
-	memset(page, 0, PAGE_SIZE);
+	for (u64 i = 0; i < npages; i++)
+		memset(addr + i * PAGE_SIZE, 0, PAGE_SIZE);
 }
 
+static inline void clear_page(void *addr)
+{
+	clear_pages(addr, 1);
+}
 static inline void copy_page(void *to, void *from)
 {
 	memcpy(to, from, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 28b9adbc5f00..5625d616bd00 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,8 +43,11 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 void memzero_page_aligned_unrolled(void *addr, u64 len);
 
 /*
- * clear_page() - clear kernel page.
- * @page: address of kernel page
+ * clear_pages() - clear kernel page range.
+ * @addr: start address of page range
+ * @npages: number of pages
+ *
+ * Assumes that (@addr, +@npages) references a kernel region.
  *
  * Switch between three implementations of page clearing based on CPU
  * capabilities:
@@ -65,22 +68,27 @@ void memzero_page_aligned_unrolled(void *addr, u64 len);
  *
  * Does absolutely no exception handling.
  */
-static inline void clear_page(void *page)
+static inline void clear_pages(void *addr, u64 npages)
 {
-	u64 len = PAGE_SIZE;
+	u64 len = npages * PAGE_SIZE;
 	/*
-	 * Clean up KMSAN metadata for the page being cleared. The assembly call
-	 * below clobbers @page, so we perform unpoisoning before it.
+	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
+	 * below clobbers @addr, so we perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, len);
+	kmsan_unpoison_memory(addr, len);
 	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
 				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
 				   "rep stosb", X86_FEATURE_ERMS)
-			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
+			: "+c" (len), "+D" (addr), ASM_CALL_CONSTRAINT
 			: "a" (0)
 			: "cc", "memory");
 }
 
+static inline void clear_page(void *addr)
+{
+	clear_pages(addr, 1);
+}
+
 void copy_page(void *to, void *from);
 KCFI_REFERENCE(copy_page);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (10 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 11/14] x86/clear_page: Introduce clear_pages() Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-10  7:58   ` Andrew Morton
                     ` (2 more replies)
  2025-07-10  0:59 ` [PATCH v5 13/14] mm: memory: support " Ankur Arora
  2025-07-10  0:59 ` [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents Ankur Arora
  13 siblings, 3 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
where architecturally supported.

This is only available with !CONFIG_HIGHMEM because the intent is to
use architecture support to clear contiguous extents in a single
operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
which excludes any possibility of interspersing kmap()/kunmap().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/Kconfig | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..a74a5e02de28 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -910,6 +910,15 @@ config NO_PAGE_MAPCOUNT
 
 endif # TRANSPARENT_HUGEPAGE
 
+config CLEAR_PAGE_EXTENT
+	def_bool y
+	depends on !HIGHMEM && ARCH_HAS_CLEAR_PAGES
+	depends on TRANSPARENT_HUGEPAGE || HUGETLBFS
+
+	help
+	  Use architectural support for clear_pages() to zero page-extents.
+	  This is likely to be faster than zeroing page-at-a-time.
+
 # simple helper to make the code a bit easier to read
 config PAGE_MAPCOUNT
 	def_bool !NO_PAGE_MAPCOUNT
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 13/14] mm: memory: support clearing page-extents
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (11 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 12/14] mm: add config option for clearing page-extents Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  2025-07-11 11:44   ` David Hildenbrand
  2025-07-15 22:08   ` David Hildenbrand
  2025-07-10  0:59 ` [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents Ankur Arora
  13 siblings, 2 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

folio_zero_user() is constrained to clear in a page-at-a-time
fashion because it supports CONFIG_HIGHMEM which means that kernel
mappings for pages in a folio are not guaranteed to be contiguous.

We don't have this problem when running under configurations with
CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in 
longer page-extents.
This is expected to be faster because the processor can now optimize
the clearing based on the knowledge of the extent.

However, clearing in larger chunks can have two other problems:

 - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
   (larger folios don't have any expectation of cache locality).

 - preemption latency when clearing large folios.

Handle the first by splitting the clearing in three parts: the
faulting page and its immediate locality, its left and right
regions; the local neighbourhood is cleared last.

The second problem is relevant only when running under cooperative
preemption models. Limit the worst case preemption latency by clearing
in architecture specified ARCH_CLEAR_PAGE_EXTENT units.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 85 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b0cda5aab398..c52806270375 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7034,6 +7034,7 @@ static inline int process_huge_page(
 	return 0;
 }
 
+#ifndef CONFIG_CLEAR_PAGE_EXTENT
 static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
 				unsigned int nr_pages)
 {
@@ -7058,7 +7059,10 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
 /**
  * folio_zero_user - Zero a folio which will be mapped to userspace.
  * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * folio_zero_user() uses clear_gigantic_page() or process_huge_page() to
+ * do page-at-a-time zeroing because it needs to handle CONFIG_HIGHMEM.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
@@ -7070,6 +7074,86 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
 }
 
+#else /* CONFIG_CLEAR_PAGE_EXTENT */
+
+static void clear_pages_resched(void *addr, int npages)
+{
+	int i, remaining;
+
+	if (preempt_model_preemptible()) {
+		clear_pages(addr, npages);
+		goto out;
+	}
+
+	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
+		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
+			    ARCH_CLEAR_PAGE_EXTENT);
+		cond_resched();
+	}
+
+	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
+
+	if (remaining)
+		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
+			    remaining);
+out:
+	cond_resched();
+}
+
+/*
+ * folio_zero_user - Zero a folio which will be mapped to userspace.
+ * @folio: The folio to zero.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
+ */
+void folio_zero_user(struct folio *folio, unsigned long addr_hint)
+{
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
+
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		int npages = range_len(&r[i]);
+
+		if (npages > 0)
+			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
+	}
+}
+#endif /* CONFIG_CLEAR_PAGE_EXTENT */
+
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
 				   unsigned long addr_hint,
 				   struct vm_area_struct *vma,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents
  2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
                   ` (12 preceding siblings ...)
  2025-07-10  0:59 ` [PATCH v5 13/14] mm: memory: support " Ankur Arora
@ 2025-07-10  0:59 ` Ankur Arora
  13 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-10  0:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Define ARCH_HAS_CLEAR_PAGES so hugepage zeroing (via folio_zero_user())
can use clear_pages() to clear in page-extents. This allows the
processor -- when using string instructions (REP; STOS) -- to optimize
based on the extent size.

Also define ARCH_CLEAR_PAGE_EXTENT which is used by folio_zero_user() to
decide the maximum extent to be zeroed when running under cooperative
preemption models.

The resultant performance depends on the kinds of optimizations
available to the uarch for the extent being cleared. Two class
of optimizations:

  - clearing iteration costs can be amortized over a range larger that
    a single page.
  - cacheline allocation elision (seen only on AMD Zen models).

A demand fault workload shows an improved baseline due to the first
optimization and a larger improvement when the extent is large enough
for the second one.

AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

 $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%	preempt=*

   pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98% [#]   +  1.6%	preempt=none|voluntary
   pg-sz=1GB       17.51  +- 1.19%        43.23  +-  5.22%       +146.8%	preempt=full|lazy

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than ARCH_CLEAR_PAGE_EXTENT, so
preempt=none|voluntary sees no improvement on the pg-sz=1GB.

The improvement due to the hardware eliding cacheline allocation for
pg-sz=1GB can be seen in the reduced L1-dcache-loads:

   - 44,513,459,667      cycles                           #    2.420 GHz                         ( +-  0.44% )  (35.71%)
   -  1,378,032,592      instructions                     #    0.03  insn per cycle
   - 11,224,288,082      L1-dcache-loads                  #  610.187 M/sec                       ( +-  0.08% )  (35.72%)
   -  5,373,473,118      L1-dcache-load-misses            #   47.87% of all L1-dcache accesses   ( +-  0.00% )  (35.71%)

   + 20,093,219,076      cycles                           #    2.421 GHz                         ( +-  3.64% )  (35.69%)
   +  1,378,032,592      instructions                     #    0.03  insn per cycle
   +    186,525,095      L1-dcache-loads                  #   22.479 M/sec                       ( +-  2.11% )  (35.74%)
   +     73,479,687      L1-dcache-load-misses            #   39.39% of all L1-dcache accesses   ( +-  3.03% )  (35.74%)

Also as mentioned earlier, the baseline improvement is not specific to
AMD Zen*. Intel Icelakex (pg-sz=2MB|1GB) sees a similar improvement as
the Milan pg-sz=2MB workload above (~35%).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig               | 4 ++++
 arch/x86/include/asm/page_64.h | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 71019b3b54ea..8a7ce6ab229b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -395,6 +395,10 @@ config GENERIC_CALIBRATE_DELAY
 config ARCH_HAS_CPU_RELAX
 	def_bool y
 
+config ARCH_HAS_CLEAR_PAGES
+	def_bool y
+	depends on X86_64 && !HIGHMEM
+
 config ARCH_HIBERNATION_POSSIBLE
 	def_bool y
 
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 5625d616bd00..221c7404fc3a 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,6 +40,13 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 #define __phys_reloc_hide(x)	(x)
 
+/*
+ * When running under voluntary preemption models, limit the max extent
+ * to pages worth 8MB. With a clearing BW of ~10GBps, this should result
+ * in a worst case scheduling latency of ~1ms.
+ */
+#define ARCH_CLEAR_PAGE_EXTENT (8 << (20 - PAGE_SHIFT))
+
 void memzero_page_aligned_unrolled(void *addr, u64 len);
 
 /*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-10  0:59 ` [PATCH v5 12/14] mm: add config option for clearing page-extents Ankur Arora
@ 2025-07-10  7:58   ` Andrew Morton
  2025-07-10 16:31     ` Ankur Arora
  2025-07-11 11:39   ` David Hildenbrand
  2025-07-11 11:40   ` David Hildenbrand
  2 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2025-07-10  7:58 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, david, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On Wed,  9 Jul 2025 17:59:24 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> This is only available with !CONFIG_HIGHMEM because the intent is to
> use architecture support to clear contiguous extents in a single
> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
> which excludes any possibility of interspersing kmap()/kunmap().

I'm feeling dumb.  What does the use of a string instruction have to do
with kmap/kunmap?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-10  7:58   ` Andrew Morton
@ 2025-07-10 16:31     ` Ankur Arora
  0 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-10 16:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed,  9 Jul 2025 17:59:24 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> This is only available with !CONFIG_HIGHMEM because the intent is to
>> use architecture support to clear contiguous extents in a single
>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>> which excludes any possibility of interspersing kmap()/kunmap().
>
> I'm feeling dumb.  What does the use of a string instruction have to do
> with kmap/kunmap?

Sorry, that might have been unnecessarily inscrutable. How about
something like this:

 This is only enabled with !CONFIG_HIGHMEM because the intent is
 to use architecture support to clear contiguous extents in a
 single interruptible operation (ex. via FEAT_MOPS on arm64,
 string instructions on x86).

 Given that we might be zeroing the whole extent with a single
 instruction, this excludes any possibility of constructing
 intermediate HIGHMEM maps.

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-10  0:59 ` [PATCH v5 12/14] mm: add config option for clearing page-extents Ankur Arora
  2025-07-10  7:58   ` Andrew Morton
@ 2025-07-11 11:39   ` David Hildenbrand
  2025-07-11 17:25     ` Ankur Arora
  2025-07-11 11:40   ` David Hildenbrand
  2 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-07-11 11:39 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On 10.07.25 02:59, Ankur Arora wrote:
> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
> where architecturally supported.
> 
> This is only available with !CONFIG_HIGHMEM because the intent is to
> use architecture support to clear contiguous extents in a single
> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
> which excludes any possibility of interspersing kmap()/kunmap().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>   mm/Kconfig | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 781be3240e21..a74a5e02de28 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -910,6 +910,15 @@ config NO_PAGE_MAPCOUNT
>   
>   endif # TRANSPARENT_HUGEPAGE
>   
> +config CLEAR_PAGE_EXTENT
> +	def_bool y
> +	depends on !HIGHMEM && ARCH_HAS_CLEAR_PAGES
> +	depends on TRANSPARENT_HUGEPAGE || HUGETLBFS
> +
> +	help
> +	  Use architectural support for clear_pages() to zero page-extents.
> +	  This is likely to be faster than zeroing page-at-a-time.
> +

IIRC, adding a help text will make this option be configurable by the 
user, no?

This seems like something we can just keep internal, right?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-10  0:59 ` [PATCH v5 12/14] mm: add config option for clearing page-extents Ankur Arora
  2025-07-10  7:58   ` Andrew Morton
  2025-07-11 11:39   ` David Hildenbrand
@ 2025-07-11 11:40   ` David Hildenbrand
  2025-07-11 17:32     ` Ankur Arora
  2 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-07-11 11:40 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On 10.07.25 02:59, Ankur Arora wrote:
> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
> where architecturally supported.
> 
> This is only available with !CONFIG_HIGHMEM because the intent is to
> use architecture support to clear contiguous extents in a single
> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
> which excludes any possibility of interspersing kmap()/kunmap().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---

Staring at the next patch, I think this can easily be squashed into the 
next patch where you add actual MM core support.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
  2025-07-10  0:59 ` [PATCH v5 13/14] mm: memory: support " Ankur Arora
@ 2025-07-11 11:44   ` David Hildenbrand
  2025-07-11 13:27     ` Raghavendra K T
  2025-07-11 17:39     ` Ankur Arora
  2025-07-15 22:08   ` David Hildenbrand
  1 sibling, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-07-11 11:44 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On 10.07.25 02:59, Ankur Arora wrote:
> folio_zero_user() is constrained to clear in a page-at-a-time
> fashion because it supports CONFIG_HIGHMEM which means that kernel
> mappings for pages in a folio are not guaranteed to be contiguous.
> 
> We don't have this problem when running under configurations with
> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
> longer page-extents.
> This is expected to be faster because the processor can now optimize
> the clearing based on the knowledge of the extent.
> 

I'm curious: how did we end up "page extent" terminology? :)

"CONFIG_CLEAR_CONTIG_PAGES" or sth. like that would be a bit clearer 
(pun intended), at least to me.

We use "CONTIG" pages already in different MM context.

"page extent" reminds of "page_ext", which is "page extension" ...

> However, clearing in larger chunks can have two other problems:
> 
>   - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>     (larger folios don't have any expectation of cache locality).
> 
>   - preemption latency when clearing large folios.
> 
> Handle the first by splitting the clearing in three parts: the
> faulting page and its immediate locality, its left and right
> regions; the local neighbourhood is cleared last.
> 
> The second problem is relevant only when running under cooperative
> preemption models. Limit the worst case preemption latency by clearing
> in architecture specified ARCH_CLEAR_PAGE_EXTENT units.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 10/14] x86/mm: Simplify clear_page_*
  2025-07-10  0:59 ` [PATCH v5 10/14] x86/mm: Simplify clear_page_* Ankur Arora
@ 2025-07-11 11:47   ` David Hildenbrand
  2025-07-11 17:26     ` Ankur Arora
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-07-11 11:47 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On 10.07.25 02:59, Ankur Arora wrote:
> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
> variations. Inlining gets rid of an unnecessary CALL/RET (which isn't
> free when using RETHUNK speculative execution mitigations.)
> Fixup and rename clear_page_orig() to adapt to the changed calling
> convention.
> 
> And, add a comment from Dave Hansen detailing various clearing mechanisms
> used in clear_page().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>   arch/x86/include/asm/page_32.h |  6 +++++
>   arch/x86/include/asm/page_64.h | 42 ++++++++++++++++++++++++++--------
>   arch/x86/lib/clear_page_64.S   | 39 +++++++------------------------
>   3 files changed, 46 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
> index 0c623706cb7e..a8ff43bb9652 100644
> --- a/arch/x86/include/asm/page_32.h
> +++ b/arch/x86/include/asm/page_32.h
> @@ -17,6 +17,12 @@ extern unsigned long __phys_addr(unsigned long);
>   
>   #include <linux/string.h>
>   
> +/*

/** if this was supposed to be kernel doc (which it looks like it is)

> + * clear_page() - clear kernel page.

"clear a kernel page"

Although I am not sure what a "kernel page" is.

Did you mean "clear a page using a kernel virtual address" ?

(we have a handful of "kernel page" usages, where we talk about non-user 
space allocations)

> + * @page: address of kernel page
> + *
> + * Does absolutely no exception handling.
> + */
>   static inline void clear_page(void *page)
>   {
>   	memset(page, 0, PAGE_SIZE);
> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
> index 015d23f3e01f..28b9adbc5f00 100644
> --- a/arch/x86/include/asm/page_64.h
> +++ b/arch/x86/include/asm/page_64.h
> @@ -40,23 +40,45 @@ extern unsigned long __phys_addr_symbol(unsigned long);
>   
>   #define __phys_reloc_hide(x)	(x)
>   
> -void clear_page_orig(void *page);
> -void clear_page_rep(void *page);
> -void clear_page_erms(void *page);
> +void memzero_page_aligned_unrolled(void *addr, u64 len);
>   
> +/*
 > + * clear_page() - clear kernel page.> + * @page: address of kernel page

Same comment as above.



-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
  2025-07-11 11:44   ` David Hildenbrand
@ 2025-07-11 13:27     ` Raghavendra K T
  2025-07-11 17:39     ` Ankur Arora
  1 sibling, 0 replies; 52+ messages in thread
From: Raghavendra K T @ 2025-07-11 13:27 UTC (permalink / raw)
  To: David Hildenbrand, Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, boris.ostrovsky, konrad.wilk


On 7/11/2025 5:14 PM, David Hildenbrand wrote:
> On 10.07.25 02:59, Ankur Arora wrote:
>> folio_zero_user() is constrained to clear in a page-at-a-time
>> fashion because it supports CONFIG_HIGHMEM which means that kernel
>> mappings for pages in a folio are not guaranteed to be contiguous.
>>
>> We don't have this problem when running under configurations with
>> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
>> longer page-extents.
>> This is expected to be faster because the processor can now optimize
>> the clearing based on the knowledge of the extent.
>>
> 
> I'm curious: how did we end up "page extent" terminology? :)
> 
> "CONFIG_CLEAR_CONTIG_PAGES" or sth. like that would be a bit clearer 
> (pun intended), at least to me.
> 

How about CONFIG_CLEAR_MULTI_PAGE ?

> We use "CONTIG" pages already in different MM context.
> 
> "page extent" reminds of "page_ext", which is "page extension" ...
> 
[...]

- Raghu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-11 11:39   ` David Hildenbrand
@ 2025-07-11 17:25     ` Ankur Arora
  2025-07-11 19:14       ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-11 17:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 10.07.25 02:59, Ankur Arora wrote:
>> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
>> where architecturally supported.
>> This is only available with !CONFIG_HIGHMEM because the intent is to
>> use architecture support to clear contiguous extents in a single
>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>> which excludes any possibility of interspersing kmap()/kunmap().
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   mm/Kconfig | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 781be3240e21..a74a5e02de28 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -910,6 +910,15 @@ config NO_PAGE_MAPCOUNT
>>     endif # TRANSPARENT_HUGEPAGE
>>   +config CLEAR_PAGE_EXTENT
>> +	def_bool y
>> +	depends on !HIGHMEM && ARCH_HAS_CLEAR_PAGES
>> +	depends on TRANSPARENT_HUGEPAGE || HUGETLBFS
>> +
>> +	help
>> +	  Use architectural support for clear_pages() to zero page-extents.
>> +	  This is likely to be faster than zeroing page-at-a-time.
>> +
>
> IIRC, adding a help text will make this option be configurable by the user, no?

I tried changing the value in the generated .config by hand and that
reverted back to the computed value. So, I think this isn't configurable
by the user.

From a quick look at the kbuild documentation, for this to be user
selectable it needs a prompt of some kind.

That said, the help text doesn't add anything here. Will remove.

> This seems like something we can just keep internal, right?

Agreed.

Thanks for the review!

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 10/14] x86/mm: Simplify clear_page_*
  2025-07-11 11:47   ` David Hildenbrand
@ 2025-07-11 17:26     ` Ankur Arora
  2025-07-11 19:03       ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-11 17:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 10.07.25 02:59, Ankur Arora wrote:
>> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
>> variations. Inlining gets rid of an unnecessary CALL/RET (which isn't
>> free when using RETHUNK speculative execution mitigations.)
>> Fixup and rename clear_page_orig() to adapt to the changed calling
>> convention.
>> And, add a comment from Dave Hansen detailing various clearing mechanisms
>> used in clear_page().
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   arch/x86/include/asm/page_32.h |  6 +++++
>>   arch/x86/include/asm/page_64.h | 42 ++++++++++++++++++++++++++--------
>>   arch/x86/lib/clear_page_64.S   | 39 +++++++------------------------
>>   3 files changed, 46 insertions(+), 41 deletions(-)
>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>> index 0c623706cb7e..a8ff43bb9652 100644
>> --- a/arch/x86/include/asm/page_32.h
>> +++ b/arch/x86/include/asm/page_32.h
>> @@ -17,6 +17,12 @@ extern unsigned long __phys_addr(unsigned long);
>>     #include <linux/string.h>
>>   +/*
>
> /** if this was supposed to be kernel doc (which it looks like it is)
>
>> + * clear_page() - clear kernel page.
>
> "clear a kernel page"
>
> Although I am not sure what a "kernel page" is.
>
> Did you mean "clear a page using a kernel virtual address" ?

Thanks. Yes, this makes way more sense.

> (we have a handful of "kernel page" usages, where we talk about non-user space
> allocations)
>
>> + * @page: address of kernel page
>> + *
>> + * Does absolutely no exception handling.
>> + */
>>   static inline void clear_page(void *page)
>>   {
>>   	memset(page, 0, PAGE_SIZE);
>> diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
>> index 015d23f3e01f..28b9adbc5f00 100644
>> --- a/arch/x86/include/asm/page_64.h
>> +++ b/arch/x86/include/asm/page_64.h
>> @@ -40,23 +40,45 @@ extern unsigned long __phys_addr_symbol(unsigned long);
>>     #define __phys_reloc_hide(x)	(x)
>>   -void clear_page_orig(void *page);
>> -void clear_page_rep(void *page);
>> -void clear_page_erms(void *page);
>> +void memzero_page_aligned_unrolled(void *addr, u64 len);
>>   +/*
>> + * clear_page() - clear kernel page.> + * @page: address of kernel page
>
> Same comment as above.


--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-11 11:40   ` David Hildenbrand
@ 2025-07-11 17:32     ` Ankur Arora
  2025-07-11 19:26       ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-11 17:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 10.07.25 02:59, Ankur Arora wrote:
>> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
>> where architecturally supported.
>> This is only available with !CONFIG_HIGHMEM because the intent is to
>> use architecture support to clear contiguous extents in a single
>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>> which excludes any possibility of interspersing kmap()/kunmap().
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>
> Staring at the next patch, I think this can easily be squashed into the next
> patch where you add actual MM core support.

I wanted to do this in a separate patch to explicitly document what the
responsibility of the interface provided by the architecture is.
That said, the commit message didn't actually do a good job of doing
that :D.

Copying the more detailed commit message from my reply to Andrew,
one important part of the clear_pages() is that it be interruptible
because clear_pages_resched() implicitly depends on it.

> This is only enabled with !CONFIG_HIGHMEM because the intent is
> to use architecture support to clear contiguous extents in a
> single interruptible operation (ex. via FEAT_MOPS on arm64,
> string instructions on x86).

> Given that we might be zeroing the whole extent with a single
> instruction, this excludes any possibility of constructing
> intermediate HIGHMEM maps.

Do you think it is best documented in the next patch in a comment
instead?

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
  2025-07-11 11:44   ` David Hildenbrand
  2025-07-11 13:27     ` Raghavendra K T
@ 2025-07-11 17:39     ` Ankur Arora
  1 sibling, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-11 17:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 10.07.25 02:59, Ankur Arora wrote:
>> folio_zero_user() is constrained to clear in a page-at-a-time
>> fashion because it supports CONFIG_HIGHMEM which means that kernel
>> mappings for pages in a folio are not guaranteed to be contiguous.
>> We don't have this problem when running under configurations with
>> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
>> longer page-extents.
>> This is expected to be faster because the processor can now optimize
>> the clearing based on the knowledge of the extent.
>>
>
> I'm curious: how did we end up "page extent" terminology? :)

After trying all the others! I tried multi-page and contig_clear_pages
and finally decided to go with page-extent since clear_pages()
essentially takes a page-extent as an argument.

> "CONFIG_CLEAR_CONTIG_PAGES" or sth. like that would be a bit clearer (pun
> intended), at least to me.

Oddly enough I tried CONFIG_CONTIG_CLEAR_PAGES and
CONFIG_CLEAR_PAGES_CONTIG but those read a little awkwardly.

CONFIG_CLEAR_CONTIG_PAGES reads better. Should have tried moving the
CONTIG to the middle as well :D.

> We use "CONTIG" pages already in different MM context.
>
> "page extent" reminds of "page_ext", which is "page extension" ...

Yeah I see that in the Kconfig now. Will switch to this one.

Thanks

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 10/14] x86/mm: Simplify clear_page_*
  2025-07-11 17:26     ` Ankur Arora
@ 2025-07-11 19:03       ` David Hildenbrand
  2025-07-11 19:24         ` Ankur Arora
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-07-11 19:03 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On 11.07.25 19:26, Ankur Arora wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 10.07.25 02:59, Ankur Arora wrote:
>>> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
>>> variations. Inlining gets rid of an unnecessary CALL/RET (which isn't
>>> free when using RETHUNK speculative execution mitigations.)
>>> Fixup and rename clear_page_orig() to adapt to the changed calling
>>> convention.
>>> And, add a comment from Dave Hansen detailing various clearing mechanisms
>>> used in clear_page().
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>>    arch/x86/include/asm/page_32.h |  6 +++++
>>>    arch/x86/include/asm/page_64.h | 42 ++++++++++++++++++++++++++--------
>>>    arch/x86/lib/clear_page_64.S   | 39 +++++++------------------------
>>>    3 files changed, 46 insertions(+), 41 deletions(-)
>>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>>> index 0c623706cb7e..a8ff43bb9652 100644
>>> --- a/arch/x86/include/asm/page_32.h
>>> +++ b/arch/x86/include/asm/page_32.h
>>> @@ -17,6 +17,12 @@ extern unsigned long __phys_addr(unsigned long);
>>>      #include <linux/string.h>
>>>    +/*
>>
>> /** if this was supposed to be kernel doc (which it looks like it is)
>>
>>> + * clear_page() - clear kernel page.
>>
>> "clear a kernel page"
>>
>> Although I am not sure what a "kernel page" is.
>>
>> Did you mean "clear a page using a kernel virtual address" ?
> 
> Thanks. Yes, this makes way more sense.

FWIW, most clear_user_page() will just call clear_page(). The ones that 
don't, have to deal with the dcache flushing, as the page might have 
cache alias from another (mm) address space.

So clear_page() is just about clearing page content using a kernel 
virtual address, and clear_user_page() is additionally taking care of 
any dcache aliases.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-11 17:25     ` Ankur Arora
@ 2025-07-11 19:14       ` David Hildenbrand
  2025-07-11 19:35         ` Ankur Arora
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-07-11 19:14 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On 11.07.25 19:25, Ankur Arora wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 10.07.25 02:59, Ankur Arora wrote:
>>> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
>>> where architecturally supported.
>>> This is only available with !CONFIG_HIGHMEM because the intent is to
>>> use architecture support to clear contiguous extents in a single
>>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>>> which excludes any possibility of interspersing kmap()/kunmap().
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>>    mm/Kconfig | 9 +++++++++
>>>    1 file changed, 9 insertions(+)
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 781be3240e21..a74a5e02de28 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -910,6 +910,15 @@ config NO_PAGE_MAPCOUNT
>>>      endif # TRANSPARENT_HUGEPAGE
>>>    +config CLEAR_PAGE_EXTENT
>>> +	def_bool y
>>> +	depends on !HIGHMEM && ARCH_HAS_CLEAR_PAGES
>>> +	depends on TRANSPARENT_HUGEPAGE || HUGETLBFS
>>> +
>>> +	help
>>> +	  Use architectural support for clear_pages() to zero page-extents.
>>> +	  This is likely to be faster than zeroing page-at-a-time.
>>> +
>>
>> IIRC, adding a help text will make this option be configurable by the user, no?
> 
> I tried changing the value in the generated .config by hand and that
> reverted back to the computed value. So, I think this isn't configurable
> by the user.

Yes, I misremembered and the help text confused me.

So yes, the help text in that case (internal entries) is uncommon (no 
need to guide the user when there are no options :) ), but feel free to 
keep it for documentation purposes.

(e.g., HAVE_64BIT_ALIGNED_ACCESS does that)

Maybe we should call it similarly HAVE_... to make it clearer that there 
are really no options.

But I'm bad at #kconfig, so whatever you prefer.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 10/14] x86/mm: Simplify clear_page_*
  2025-07-11 19:03       ` David Hildenbrand
@ 2025-07-11 19:24         ` Ankur Arora
  2025-07-11 19:27           ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-11 19:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 11.07.25 19:26, Ankur Arora wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 10.07.25 02:59, Ankur Arora wrote:
>>>> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
>>>> variations. Inlining gets rid of an unnecessary CALL/RET (which isn't
>>>> free when using RETHUNK speculative execution mitigations.)
>>>> Fixup and rename clear_page_orig() to adapt to the changed calling
>>>> convention.
>>>> And, add a comment from Dave Hansen detailing various clearing mechanisms
>>>> used in clear_page().
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>>    arch/x86/include/asm/page_32.h |  6 +++++
>>>>    arch/x86/include/asm/page_64.h | 42 ++++++++++++++++++++++++++--------
>>>>    arch/x86/lib/clear_page_64.S   | 39 +++++++------------------------
>>>>    3 files changed, 46 insertions(+), 41 deletions(-)
>>>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>>>> index 0c623706cb7e..a8ff43bb9652 100644
>>>> --- a/arch/x86/include/asm/page_32.h
>>>> +++ b/arch/x86/include/asm/page_32.h
>>>> @@ -17,6 +17,12 @@ extern unsigned long __phys_addr(unsigned long);
>>>>      #include <linux/string.h>
>>>>    +/*
>>>
>>> /** if this was supposed to be kernel doc (which it looks like it is)
>>>
>>>> + * clear_page() - clear kernel page.
>>>
>>> "clear a kernel page"
>>>
>>> Although I am not sure what a "kernel page" is.
>>>
>>> Did you mean "clear a page using a kernel virtual address" ?
>> Thanks. Yes, this makes way more sense.
>
> FWIW, most clear_user_page() will just call clear_page(). The ones that don't,
> have to deal with the dcache flushing, as the page might have cache alias from
> another (mm) address space.
>
> So clear_page() is just about clearing page content using a kernel virtual
> address, and clear_user_page() is additionally taking care of any dcache
> aliases.

Yeah that's a good point.

Right now I'm assuming that clear_user_page() and clear_page() are
effectively identical. Which as you say, isn't true for all the
architectures that can't disambiguate dcache aliases.

So maybe I should be adjusting the naming of the helpers a bit?

Specifically, by calling clear_user_pages() from clear_pages_resched().
Which on x86 etc can just call clear_pages().

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-11 17:32     ` Ankur Arora
@ 2025-07-11 19:26       ` David Hildenbrand
  2025-07-11 19:42         ` Ankur Arora
  2025-07-14 20:35         ` Ankur Arora
  0 siblings, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-07-11 19:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On 11.07.25 19:32, Ankur Arora wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 10.07.25 02:59, Ankur Arora wrote:
>>> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
>>> where architecturally supported.
>>> This is only available with !CONFIG_HIGHMEM because the intent is to
>>> use architecture support to clear contiguous extents in a single
>>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>>> which excludes any possibility of interspersing kmap()/kunmap().
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>
>> Staring at the next patch, I think this can easily be squashed into the next
>> patch where you add actual MM core support.
> 
> I wanted to do this in a separate patch to explicitly document what the
> responsibility of the interface provided by the architecture is.
> That said, the commit message didn't actually do a good job of doing
> that :D.
> 
> Copying the more detailed commit message from my reply to Andrew,
> one important part of the clear_pages() is that it be interruptible
> because clear_pages_resched() implicitly depends on it.
> 
>> This is only enabled with !CONFIG_HIGHMEM because the intent is
>> to use architecture support to clear contiguous extents in a
>> single interruptible operation (ex. via FEAT_MOPS on arm64,
>> string instructions on x86).
> 
>> Given that we might be zeroing the whole extent with a single
>> instruction, this excludes any possibility of constructing
>> intermediate HIGHMEM maps.
> 
> Do you think it is best documented in the next patch in a comment
> instead?

I would just add + document it as part of the next patch.

Looking at the bigger picture now, you introduce

	ARCH_HAS_CLEAR_PAGES

To say whether an architecture provides clear_pages().

Now we want to conditionally use that to optimize folio_zero_user().

Remind me, why do we want to glue this to THP / HUGETLBFS only? I would 
assume that the code footprint is rather small, and the systems out 
there that are compiled with ARCH_HAS_CLEAR_PAGES but without THP / 
HUGETLBFS are rather ... rare (mostly 32BIT x86 only).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 10/14] x86/mm: Simplify clear_page_*
  2025-07-11 19:24         ` Ankur Arora
@ 2025-07-11 19:27           ` David Hildenbrand
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-07-11 19:27 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On 11.07.25 21:24, Ankur Arora wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 11.07.25 19:26, Ankur Arora wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 10.07.25 02:59, Ankur Arora wrote:
>>>>> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
>>>>> variations. Inlining gets rid of an unnecessary CALL/RET (which isn't
>>>>> free when using RETHUNK speculative execution mitigations.)
>>>>> Fixup and rename clear_page_orig() to adapt to the changed calling
>>>>> convention.
>>>>> And, add a comment from Dave Hansen detailing various clearing mechanisms
>>>>> used in clear_page().
>>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>>> ---
>>>>>     arch/x86/include/asm/page_32.h |  6 +++++
>>>>>     arch/x86/include/asm/page_64.h | 42 ++++++++++++++++++++++++++--------
>>>>>     arch/x86/lib/clear_page_64.S   | 39 +++++++------------------------
>>>>>     3 files changed, 46 insertions(+), 41 deletions(-)
>>>>> diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
>>>>> index 0c623706cb7e..a8ff43bb9652 100644
>>>>> --- a/arch/x86/include/asm/page_32.h
>>>>> +++ b/arch/x86/include/asm/page_32.h
>>>>> @@ -17,6 +17,12 @@ extern unsigned long __phys_addr(unsigned long);
>>>>>       #include <linux/string.h>
>>>>>     +/*
>>>>
>>>> /** if this was supposed to be kernel doc (which it looks like it is)
>>>>
>>>>> + * clear_page() - clear kernel page.
>>>>
>>>> "clear a kernel page"
>>>>
>>>> Although I am not sure what a "kernel page" is.
>>>>
>>>> Did you mean "clear a page using a kernel virtual address" ?
>>> Thanks. Yes, this makes way more sense.
>>
>> FWIW, most clear_user_page() will just call clear_page(). The ones that don't,
>> have to deal with the dcache flushing, as the page might have cache alias from
>> another (mm) address space.
>>
>> So clear_page() is just about clearing page content using a kernel virtual
>> address, and clear_user_page() is additionally taking care of any dcache
>> aliases.
> 
> Yeah that's a good point.
> 
> Right now I'm assuming that clear_user_page() and clear_page() are
> effectively identical. Which as you say, isn't true for all the
> architectures that can't disambiguate dcache aliases.
> 
> So maybe I should be adjusting the naming of the helpers a bit?
> 
> Specifically, by calling clear_user_pages() from clear_pages_resched().
> Which on x86 etc can just call clear_pages().

Right, and calling it from folio_zero_user() :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-11 19:14       ` David Hildenbrand
@ 2025-07-11 19:35         ` Ankur Arora
  0 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-11 19:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 11.07.25 19:25, Ankur Arora wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 10.07.25 02:59, Ankur Arora wrote:
>>>> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
>>>> where architecturally supported.
>>>> This is only available with !CONFIG_HIGHMEM because the intent is to
>>>> use architecture support to clear contiguous extents in a single
>>>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>>>> which excludes any possibility of interspersing kmap()/kunmap().
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>>    mm/Kconfig | 9 +++++++++
>>>>    1 file changed, 9 insertions(+)
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 781be3240e21..a74a5e02de28 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -910,6 +910,15 @@ config NO_PAGE_MAPCOUNT
>>>>      endif # TRANSPARENT_HUGEPAGE
>>>>    +config CLEAR_PAGE_EXTENT
>>>> +	def_bool y
>>>> +	depends on !HIGHMEM && ARCH_HAS_CLEAR_PAGES
>>>> +	depends on TRANSPARENT_HUGEPAGE || HUGETLBFS
>>>> +
>>>> +	help
>>>> +	  Use architectural support for clear_pages() to zero page-extents.
>>>> +	  This is likely to be faster than zeroing page-at-a-time.
>>>> +
>>>
>>> IIRC, adding a help text will make this option be configurable by the user, no?
>> I tried changing the value in the generated .config by hand and that
>> reverted back to the computed value. So, I think this isn't configurable
>> by the user.
>
> Yes, I misremembered and the help text confused me.
>
> So yes, the help text in that case (internal entries) is uncommon (no need to
> guide the user when there are no options :) ), but feel free to keep it for
> documentation purposes.
>
> (e.g., HAVE_64BIT_ALIGNED_ACCESS does that)
>
> Maybe we should call it similarly HAVE_... to make it clearer that there are
> really no options.

Actually a prefix is a good idea.

Btw, the natural follow-up (to me anyway) to this series would be to
also do this for the copy path.

From some quick tests on x86, the performance improvement was similar.

So, maybe it could just be something like: USE_CONTIG_PAGES.

But that needs a verb somewhere.

> But I'm bad at #kconfig, so whatever you prefer.

Not much better, as you can see.

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-11 19:26       ` David Hildenbrand
@ 2025-07-11 19:42         ` Ankur Arora
  2025-07-14 20:35         ` Ankur Arora
  1 sibling, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-11 19:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 11.07.25 19:32, Ankur Arora wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 10.07.25 02:59, Ankur Arora wrote:
>>>> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
>>>> where architecturally supported.
>>>> This is only available with !CONFIG_HIGHMEM because the intent is to
>>>> use architecture support to clear contiguous extents in a single
>>>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>>>> which excludes any possibility of interspersing kmap()/kunmap().
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>
>>> Staring at the next patch, I think this can easily be squashed into the next
>>> patch where you add actual MM core support.
>> I wanted to do this in a separate patch to explicitly document what the
>> responsibility of the interface provided by the architecture is.
>> That said, the commit message didn't actually do a good job of doing
>> that :D.
>> Copying the more detailed commit message from my reply to Andrew,
>> one important part of the clear_pages() is that it be interruptible
>> because clear_pages_resched() implicitly depends on it.
>>
>>> This is only enabled with !CONFIG_HIGHMEM because the intent is
>>> to use architecture support to clear contiguous extents in a
>>> single interruptible operation (ex. via FEAT_MOPS on arm64,
>>> string instructions on x86).
>>
>>> Given that we might be zeroing the whole extent with a single
>>> instruction, this excludes any possibility of constructing
>>> intermediate HIGHMEM maps.
>> Do you think it is best documented in the next patch in a comment
>> instead?
>
> I would just add + document it as part of the next patch.
>
> Looking at the bigger picture now, you introduce
>
> 	ARCH_HAS_CLEAR_PAGES
>
> To say whether an architecture provides clear_pages().
>
> Now we want to conditionally use that to optimize folio_zero_user().
>
> Remind me, why do we want to glue this to THP / HUGETLBFS only? I would assume
> that the code footprint is rather small, and the systems out there that are
> compiled with ARCH_HAS_CLEAR_PAGES but without THP / HUGETLBFS are rather ...
> rare (mostly 32BIT x86 only).

A failure of imagination, I guess. I couldn't think of a way anyone
would have to operate on contiguous pages without also having THP or
HUGETLBFS.

And, for the cases where we want to operate on kernel ranges, we could
directly use clear_pages().
(Though that needs some work to add a generic wrapper.)

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-11 19:26       ` David Hildenbrand
  2025-07-11 19:42         ` Ankur Arora
@ 2025-07-14 20:35         ` Ankur Arora
  2025-07-15 20:59           ` David Hildenbrand
  1 sibling, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-14 20:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, ziy


[ Added Zi Yan. ]

David Hildenbrand <david@redhat.com> writes:

> On 11.07.25 19:32, Ankur Arora wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 10.07.25 02:59, Ankur Arora wrote:
>>>> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
>>>> where architecturally supported.
>>>> This is only available with !CONFIG_HIGHMEM because the intent is to
>>>> use architecture support to clear contiguous extents in a single
>>>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>>>> which excludes any possibility of interspersing kmap()/kunmap().
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>
>>> Staring at the next patch, I think this can easily be squashed into the next
>>> patch where you add actual MM core support.
>> I wanted to do this in a separate patch to explicitly document what the
>> responsibility of the interface provided by the architecture is.
>> That said, the commit message didn't actually do a good job of doing
>> that :D.
>> Copying the more detailed commit message from my reply to Andrew,
>> one important part of the clear_pages() is that it be interruptible
>> because clear_pages_resched() implicitly depends on it.
>>
>>> This is only enabled with !CONFIG_HIGHMEM because the intent is
>>> to use architecture support to clear contiguous extents in a
>>> single interruptible operation (ex. via FEAT_MOPS on arm64,
>>> string instructions on x86).
>>
>>> Given that we might be zeroing the whole extent with a single
>>> instruction, this excludes any possibility of constructing
>>> intermediate HIGHMEM maps.
>> Do you think it is best documented in the next patch in a comment
>> instead?
>
> I would just add + document it as part of the next patch.
>
> Looking at the bigger picture now, you introduce
>
> 	ARCH_HAS_CLEAR_PAGES
>
> To say whether an architecture provides clear_pages().
>
> Now we want to conditionally use that to optimize folio_zero_user().
>
> Remind me, why do we want to glue this to THP / HUGETLBFS only? I would assume
> that the code footprint is rather small, and the systems out there that are
> compiled with ARCH_HAS_CLEAR_PAGES but without THP / HUGETLBFS are rather ...
> rare (mostly 32BIT x86 only).

I thought about this some more and there are a few other interfaces that
end up clearing pages:

> clear_highpage()
> clear_highpage_kasan_tagged()
> tag_clear_highpage()

In this set, there are many loops of the form:

   for (i = 0; i < n; i++)
       clear_highpage();

At least some of these (including kernel_init_pages()) could be migrated
to variations on a clear_highpages() which could be:

    static inline void clear_highpages(struct page *page, u32 num_pages)
    {
            if (!IS_ENABLED(CONFIG_HIGHMEM))
                    clear_pages_resched(page, num_pages);
            else
                    for (i = 0; i < num_pages; ++i)
                            clear_highpage(page + i);
    }

(clear_pages_resched() should be safe to be used from here because
everybody using this should be in a schedulable context.)

(The kernel_init_pages() was also suggested by Zi Yan in a review of v3 [1].)

> clear_user_highpage()

Only users folio_zero_user(), __collapse_huge_page_copy() and
userfaultd.

> clear_user_page()
Not many users apart from the highmem interface.

> clear_page()

Not many users apart from the highmem interface.

I'm happy to do this work, just not sure how to stage it. In particular I
would like to avoid a series which tries to address all of the cases.

Maybe it makes sense to handle just add the clear_highpages() variants,
folio_zero_user() handling and some of the obvious users of
clear_highpages() for v6?


Thanks
Ankur

[1] https://lore.kernel.org/lkml/AC2C5344-E655-45BB-B90B-D63C4AC8F2F6@nvidia.com/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement
  2025-07-10  0:59 ` [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement Ankur Arora
@ 2025-07-15 20:04   ` Namhyung Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:04 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

Hello,

On Wed, Jul 09, 2025 at 05:59:13PM -0700, Ankur Arora wrote:
> We have two copies of each mem benchmark: one using cycles to
> measure time, the second for gettimeofday().
> 
> Unify.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Reviewed-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

> ---
>  tools/perf/bench/mem-functions.c | 110 +++++++++++++------------------
>  1 file changed, 46 insertions(+), 64 deletions(-)
> 
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index 19d45c377ac1..8599ed96ee1f 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -51,6 +51,11 @@ static const struct option options[] = {
>  	OPT_END()
>  };
>  
> +union bench_clock {
> +	u64		cycles;
> +	struct timeval	tv;
> +};
> +
>  typedef void *(*memcpy_t)(void *, const void *, size_t);
>  typedef void *(*memset_t)(void *, int, size_t);
>  
> @@ -91,6 +96,26 @@ static u64 get_cycles(void)
>  	return clk;
>  }
>  
> +static void clock_get(union bench_clock *t)
> +{
> +	if (use_cycles)
> +		t->cycles = get_cycles();
> +	else
> +		BUG_ON(gettimeofday(&t->tv, NULL));
> +}
> +
> +static union bench_clock clock_diff(union bench_clock *s, union bench_clock *e)
> +{
> +	union bench_clock t;
> +
> +	if (use_cycles)
> +		t.cycles = e->cycles - s->cycles;
> +	else
> +		timersub(&e->tv, &s->tv, &t.tv);
> +
> +	return t;
> +}
> +
>  static double timeval2double(struct timeval *ts)
>  {
>  	return (double)ts->tv_sec + (double)ts->tv_usec / (double)USEC_PER_SEC;
> @@ -109,8 +134,7 @@ static double timeval2double(struct timeval *ts)
>  
>  struct bench_mem_info {
>  	const struct function *functions;
> -	u64 (*do_cycles)(const struct function *r, size_t size, void *src, void *dst);
> -	double (*do_gettimeofday)(const struct function *r, size_t size, void *src, void *dst);
> +	union bench_clock (*do_op)(const struct function *r, size_t size, void *src, void *dst);
>  	const char *const *usage;
>  	bool alloc_src;
>  };
> @@ -119,7 +143,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
>  {
>  	const struct function *r = &info->functions[r_idx];
>  	double result_bps = 0.0;
> -	u64 result_cycles = 0;
> +	union bench_clock rt = { 0 };
>  	void *src = NULL, *dst = zalloc(size);
>  
>  	printf("# function '%s' (%s)\n", r->name, r->desc);
> @@ -136,25 +160,23 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
>  	if (bench_format == BENCH_FORMAT_DEFAULT)
>  		printf("# Copying %s bytes ...\n\n", size_str);
>  
> -	if (use_cycles) {
> -		result_cycles = info->do_cycles(r, size, src, dst);
> -	} else {
> -		result_bps = info->do_gettimeofday(r, size, src, dst);
> -	}
> +	rt = info->do_op(r, size, src, dst);
>  
>  	switch (bench_format) {
>  	case BENCH_FORMAT_DEFAULT:
>  		if (use_cycles) {
> -			printf(" %14lf cycles/byte\n", (double)result_cycles/size_total);
> +			printf(" %14lf cycles/byte\n", (double)rt.cycles/size_total);
>  		} else {
> +			result_bps = size_total/timeval2double(&rt.tv);
>  			print_bps(result_bps);
>  		}
>  		break;
>  
>  	case BENCH_FORMAT_SIMPLE:
>  		if (use_cycles) {
> -			printf("%lf\n", (double)result_cycles/size_total);
> +			printf("%lf\n", (double)rt.cycles/size_total);
>  		} else {
> +			result_bps = size_total/timeval2double(&rt.tv);
>  			printf("%lf\n", result_bps);
>  		}
>  		break;
> @@ -235,38 +257,21 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
>  	fn(dst, src, size);
>  }
>  
> -static u64 do_memcpy_cycles(const struct function *r, size_t size, void *src, void *dst)
> +static union bench_clock do_memcpy(const struct function *r, size_t size,
> +				   void *src, void *dst)
>  {
> -	u64 cycle_start = 0ULL, cycle_end = 0ULL;
> +	union bench_clock start, end;
>  	memcpy_t fn = r->fn.memcpy;
>  	int i;
>  
>  	memcpy_prefault(fn, size, src, dst);
>  
> -	cycle_start = get_cycles();
> +	clock_get(&start);
>  	for (i = 0; i < nr_loops; ++i)
>  		fn(dst, src, size);
> -	cycle_end = get_cycles();
> +	clock_get(&end);
>  
> -	return cycle_end - cycle_start;
> -}
> -
> -static double do_memcpy_gettimeofday(const struct function *r, size_t size, void *src, void *dst)
> -{
> -	struct timeval tv_start, tv_end, tv_diff;
> -	memcpy_t fn = r->fn.memcpy;
> -	int i;
> -
> -	memcpy_prefault(fn, size, src, dst);
> -
> -	BUG_ON(gettimeofday(&tv_start, NULL));
> -	for (i = 0; i < nr_loops; ++i)
> -		fn(dst, src, size);
> -	BUG_ON(gettimeofday(&tv_end, NULL));
> -
> -	timersub(&tv_end, &tv_start, &tv_diff);
> -
> -	return (double)(((double)size * nr_loops) / timeval2double(&tv_diff));
> +	return clock_diff(&start, &end);
>  }
>  
>  struct function memcpy_functions[] = {
> @@ -292,8 +297,7 @@ int bench_mem_memcpy(int argc, const char **argv)
>  {
>  	struct bench_mem_info info = {
>  		.functions		= memcpy_functions,
> -		.do_cycles		= do_memcpy_cycles,
> -		.do_gettimeofday	= do_memcpy_gettimeofday,
> +		.do_op			= do_memcpy,
>  		.usage			= bench_mem_memcpy_usage,
>  		.alloc_src              = true,
>  	};
> @@ -301,9 +305,10 @@ int bench_mem_memcpy(int argc, const char **argv)
>  	return bench_mem_common(argc, argv, &info);
>  }
>  
> -static u64 do_memset_cycles(const struct function *r, size_t size, void *src __maybe_unused, void *dst)
> +static union bench_clock do_memset(const struct function *r, size_t size,
> +				   void *src __maybe_unused, void *dst)
>  {
> -	u64 cycle_start = 0ULL, cycle_end = 0ULL;
> +	union bench_clock start, end;
>  	memset_t fn = r->fn.memset;
>  	int i;
>  
> @@ -313,34 +318,12 @@ static u64 do_memset_cycles(const struct function *r, size_t size, void *src __m
>  	 */
>  	fn(dst, -1, size);
>  
> -	cycle_start = get_cycles();
> +	clock_get(&start);
>  	for (i = 0; i < nr_loops; ++i)
>  		fn(dst, i, size);
> -	cycle_end = get_cycles();
> +	clock_get(&end);
>  
> -	return cycle_end - cycle_start;
> -}
> -
> -static double do_memset_gettimeofday(const struct function *r, size_t size, void *src __maybe_unused, void *dst)
> -{
> -	struct timeval tv_start, tv_end, tv_diff;
> -	memset_t fn = r->fn.memset;
> -	int i;
> -
> -	/*
> -	 * We prefault the freshly allocated memory range here,
> -	 * to not measure page fault overhead:
> -	 */
> -	fn(dst, -1, size);
> -
> -	BUG_ON(gettimeofday(&tv_start, NULL));
> -	for (i = 0; i < nr_loops; ++i)
> -		fn(dst, i, size);
> -	BUG_ON(gettimeofday(&tv_end, NULL));
> -
> -	timersub(&tv_end, &tv_start, &tv_diff);
> -
> -	return (double)(((double)size * nr_loops) / timeval2double(&tv_diff));
> +	return clock_diff(&start, &end);
>  }
>  
>  static const char * const bench_mem_memset_usage[] = {
> @@ -366,8 +349,7 @@ int bench_mem_memset(int argc, const char **argv)
>  {
>  	struct bench_mem_info info = {
>  		.functions		= memset_functions,
> -		.do_cycles		= do_memset_cycles,
> -		.do_gettimeofday	= do_memset_gettimeofday,
> +		.do_op			= do_memset,
>  		.usage			= bench_mem_memset_usage,
>  	};
>  
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 02/14] perf bench mem: Defer type munging of size to float
  2025-07-10  0:59 ` [PATCH v5 02/14] perf bench mem: Defer type munging of size to float Ankur Arora
@ 2025-07-15 20:05   ` Namhyung Kim
  2025-07-16  2:17     ` Ankur Arora
  0 siblings, 1 reply; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:05 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Wed, Jul 09, 2025 at 05:59:14PM -0700, Ankur Arora wrote:
> Do type conversion to double at the point of use.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Reviewed-by: Namhyung Kim <namhyung@kernel.org>

A nitpick below.

> ---
>  tools/perf/bench/mem-functions.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index 8599ed96ee1f..b8f020379197 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -139,7 +139,7 @@ struct bench_mem_info {
>  	bool alloc_src;
>  };
>  
> -static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, double size_total)
> +static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, size_t size_total)
>  {
>  	const struct function *r = &info->functions[r_idx];
>  	double result_bps = 0.0;
> @@ -165,18 +165,18 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
>  	switch (bench_format) {
>  	case BENCH_FORMAT_DEFAULT:
>  		if (use_cycles) {
> -			printf(" %14lf cycles/byte\n", (double)rt.cycles/size_total);
> +			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)size_total);
>  		} else {
> -			result_bps = size_total/timeval2double(&rt.tv);
> +			result_bps = (double)size_total/timeval2double(&rt.tv);
>  			print_bps(result_bps);
>  		}
>  		break;
>  
>  	case BENCH_FORMAT_SIMPLE:
>  		if (use_cycles) {
> -			printf("%lf\n", (double)rt.cycles/size_total);
> +			printf("%lf\n", (double)rt.cycles/(double)size_total);
>  		} else {
> -			result_bps = size_total/timeval2double(&rt.tv);
> +			result_bps = (double)size_total/timeval2double(&rt.tv);
>  			printf("%lf\n", result_bps);
>  		}
>  		break;
> @@ -199,7 +199,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  {
>  	int i;
>  	size_t size;
> -	double size_total;
> +	size_t size_total;
>  
>  	argc = parse_options(argc, argv, options, info->usage, 0);
>  
> @@ -212,7 +212,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  	}
>  
>  	size = (size_t)perf_atoll((char *)size_str);
> -	size_total = (double)size * nr_loops;
> +	size_total = (size_t)size * nr_loops;

No need to cast.

Thanks,
Namhyung

>  
>  	if ((s64)size <= 0) {
>  		fprintf(stderr, "Invalid size:%s\n", size_str);
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 03/14] perf bench mem: Move mem op parameters into a structure
  2025-07-10  0:59 ` [PATCH v5 03/14] perf bench mem: Move mem op parameters into a structure Ankur Arora
@ 2025-07-15 20:06   ` Namhyung Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:06 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Wed, Jul 09, 2025 at 05:59:15PM -0700, Ankur Arora wrote:
> Move benchmark function parameters in struct bench_params.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Reviewed-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

> ---
>  tools/perf/bench/mem-functions.c | 63 +++++++++++++++++---------------
>  1 file changed, 34 insertions(+), 29 deletions(-)
> 
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index b8f020379197..fb17d36a6f6c 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -30,7 +30,7 @@
>  
>  static const char	*size_str	= "1MB";
>  static const char	*function_str	= "all";
> -static int		nr_loops	= 1;
> +static unsigned int	nr_loops	= 1;
>  static bool		use_cycles;
>  static int		cycles_fd;
>  
> @@ -42,7 +42,7 @@ static const struct option options[] = {
>  	OPT_STRING('f', "function", &function_str, "all",
>  		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
>  
> -	OPT_INTEGER('l', "nr_loops", &nr_loops,
> +	OPT_UINTEGER('l', "nr_loops", &nr_loops,
>  		    "Specify the number of loops to run. (default: 1)"),
>  
>  	OPT_BOOLEAN('c', "cycles", &use_cycles,
> @@ -56,6 +56,12 @@ union bench_clock {
>  	struct timeval	tv;
>  };
>  
> +struct bench_params {
> +	size_t		size;
> +	size_t		size_total;
> +	unsigned int	nr_loops;
> +};
> +
>  typedef void *(*memcpy_t)(void *, const void *, size_t);
>  typedef void *(*memset_t)(void *, int, size_t);
>  
> @@ -134,17 +140,19 @@ static double timeval2double(struct timeval *ts)
>  
>  struct bench_mem_info {
>  	const struct function *functions;
> -	union bench_clock (*do_op)(const struct function *r, size_t size, void *src, void *dst);
> +	union bench_clock (*do_op)(const struct function *r, struct bench_params *p,
> +				   void *src, void *dst);
>  	const char *const *usage;
>  	bool alloc_src;
>  };
>  
> -static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, size_t size_total)
> +static void __bench_mem_function(struct bench_mem_info *info, struct bench_params *p,
> +				 int r_idx)
>  {
>  	const struct function *r = &info->functions[r_idx];
>  	double result_bps = 0.0;
>  	union bench_clock rt = { 0 };
> -	void *src = NULL, *dst = zalloc(size);
> +	void *src = NULL, *dst = zalloc(p->size);
>  
>  	printf("# function '%s' (%s)\n", r->name, r->desc);
>  
> @@ -152,7 +160,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
>  		goto out_alloc_failed;
>  
>  	if (info->alloc_src) {
> -		src = zalloc(size);
> +		src = zalloc(p->size);
>  		if (src == NULL)
>  			goto out_alloc_failed;
>  	}
> @@ -160,23 +168,23 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
>  	if (bench_format == BENCH_FORMAT_DEFAULT)
>  		printf("# Copying %s bytes ...\n\n", size_str);
>  
> -	rt = info->do_op(r, size, src, dst);
> +	rt = info->do_op(r, p, src, dst);
>  
>  	switch (bench_format) {
>  	case BENCH_FORMAT_DEFAULT:
>  		if (use_cycles) {
> -			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)size_total);
> +			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)p->size_total);
>  		} else {
> -			result_bps = (double)size_total/timeval2double(&rt.tv);
> +			result_bps = (double)p->size_total/timeval2double(&rt.tv);
>  			print_bps(result_bps);
>  		}
>  		break;
>  
>  	case BENCH_FORMAT_SIMPLE:
>  		if (use_cycles) {
> -			printf("%lf\n", (double)rt.cycles/(double)size_total);
> +			printf("%lf\n", (double)rt.cycles/(double)p->size_total);
>  		} else {
> -			result_bps = (double)size_total/timeval2double(&rt.tv);
> +			result_bps = (double)p->size_total/timeval2double(&rt.tv);
>  			printf("%lf\n", result_bps);
>  		}
>  		break;
> @@ -198,8 +206,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
>  static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *info)
>  {
>  	int i;
> -	size_t size;
> -	size_t size_total;
> +	struct bench_params p = { 0 };
>  
>  	argc = parse_options(argc, argv, options, info->usage, 0);
>  
> @@ -211,17 +218,17 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  		}
>  	}
>  
> -	size = (size_t)perf_atoll((char *)size_str);
> -	size_total = (size_t)size * nr_loops;
> -
> -	if ((s64)size <= 0) {
> +	p.nr_loops = nr_loops;
> +	p.size = (size_t)perf_atoll((char *)size_str);
> +	if ((s64)p.size <= 0) {
>  		fprintf(stderr, "Invalid size:%s\n", size_str);
>  		return 1;
>  	}
> +	p.size_total = (size_t)p.size * p.nr_loops;
>  
>  	if (!strncmp(function_str, "all", 3)) {
>  		for (i = 0; info->functions[i].name; i++)
> -			__bench_mem_function(info, i, size, size_total);
> +			__bench_mem_function(info, &p, i);
>  		return 0;
>  	}
>  
> @@ -240,7 +247,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  		return 1;
>  	}
>  
> -	__bench_mem_function(info, i, size, size_total);
> +	__bench_mem_function(info, &p, i);
>  
>  	return 0;
>  }
> @@ -257,18 +264,17 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
>  	fn(dst, src, size);
>  }
>  
> -static union bench_clock do_memcpy(const struct function *r, size_t size,
> +static union bench_clock do_memcpy(const struct function *r, struct bench_params *p,
>  				   void *src, void *dst)
>  {
>  	union bench_clock start, end;
>  	memcpy_t fn = r->fn.memcpy;
> -	int i;
>  
> -	memcpy_prefault(fn, size, src, dst);
> +	memcpy_prefault(fn, p->size, src, dst);
>  
>  	clock_get(&start);
> -	for (i = 0; i < nr_loops; ++i)
> -		fn(dst, src, size);
> +	for (unsigned int i = 0; i < p->nr_loops; ++i)
> +		fn(dst, src, p->size);
>  	clock_get(&end);
>  
>  	return clock_diff(&start, &end);
> @@ -305,22 +311,21 @@ int bench_mem_memcpy(int argc, const char **argv)
>  	return bench_mem_common(argc, argv, &info);
>  }
>  
> -static union bench_clock do_memset(const struct function *r, size_t size,
> +static union bench_clock do_memset(const struct function *r, struct bench_params *p,
>  				   void *src __maybe_unused, void *dst)
>  {
>  	union bench_clock start, end;
>  	memset_t fn = r->fn.memset;
> -	int i;
>  
>  	/*
>  	 * We prefault the freshly allocated memory range here,
>  	 * to not measure page fault overhead:
>  	 */
> -	fn(dst, -1, size);
> +	fn(dst, -1, p->size);
>  
>  	clock_get(&start);
> -	for (i = 0; i < nr_loops; ++i)
> -		fn(dst, i, size);
> +	for (unsigned int i = 0; i < p->nr_loops; ++i)
> +		fn(dst, i, p->size);
>  	clock_get(&end);
>  
>  	return clock_diff(&start, &end);
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 04/14] perf bench mem: Pull out init/fini logic
  2025-07-10  0:59 ` [PATCH v5 04/14] perf bench mem: Pull out init/fini logic Ankur Arora
@ 2025-07-15 20:09   ` Namhyung Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:09 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Wed, Jul 09, 2025 at 05:59:16PM -0700, Ankur Arora wrote:
> No functional change.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Reviewed-by: Namhyung Kim <namhyung@kernel.org>

A nitpick below.

> ---
>  tools/perf/bench/mem-functions.c             | 103 +++++++++++++------
>  tools/perf/bench/mem-memcpy-arch.h           |   2 +-
>  tools/perf/bench/mem-memcpy-x86-64-asm-def.h |   4 +
>  tools/perf/bench/mem-memset-arch.h           |   2 +-
>  tools/perf/bench/mem-memset-x86-64-asm-def.h |   4 +
>  5 files changed, 81 insertions(+), 34 deletions(-)
> 
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index fb17d36a6f6c..06d3ee6f5d69 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -62,15 +62,31 @@ struct bench_params {
>  	unsigned int	nr_loops;
>  };
>  
> +struct bench_mem_info {
> +	const struct function *functions;
> +	int (*do_op)(const struct function *r, struct bench_params *p,
> +		     void *src, void *dst, union bench_clock *rt);
> +	const char *const *usage;
> +	bool alloc_src;
> +};
> +
> +typedef bool (*mem_init_t)(struct bench_mem_info *, struct bench_params *,
> +			   void **, void **);
> +typedef void (*mem_fini_t)(struct bench_mem_info *, struct bench_params *,
> +			   void **, void **);
>  typedef void *(*memcpy_t)(void *, const void *, size_t);
>  typedef void *(*memset_t)(void *, int, size_t);
>  
>  struct function {
>  	const char *name;
>  	const char *desc;
> -	union {
> -		memcpy_t memcpy;
> -		memset_t memset;
> +	struct {
> +		mem_init_t init;
> +		mem_fini_t fini;
> +		union {
> +			memcpy_t memcpy;
> +			memset_t memset;
> +		};
>  	} fn;
>  };
>  
> @@ -138,37 +154,24 @@ static double timeval2double(struct timeval *ts)
>  			printf(" %14lf GB/sec\n", x / K / K / K);	\
>  	} while (0)
>  
> -struct bench_mem_info {
> -	const struct function *functions;
> -	union bench_clock (*do_op)(const struct function *r, struct bench_params *p,
> -				   void *src, void *dst);
> -	const char *const *usage;
> -	bool alloc_src;
> -};
> -
>  static void __bench_mem_function(struct bench_mem_info *info, struct bench_params *p,
>  				 int r_idx)
>  {
>  	const struct function *r = &info->functions[r_idx];
>  	double result_bps = 0.0;
>  	union bench_clock rt = { 0 };
> -	void *src = NULL, *dst = zalloc(p->size);
> +	void *src = NULL, *dst = NULL;
>  
>  	printf("# function '%s' (%s)\n", r->name, r->desc);
>  
> -	if (dst == NULL)
> -		goto out_alloc_failed;
> -
> -	if (info->alloc_src) {
> -		src = zalloc(p->size);
> -		if (src == NULL)
> -			goto out_alloc_failed;
> -	}
> +	if (r->fn.init && r->fn.init(info, p, &src, &dst))
> +		goto out_init_failed;
>  
>  	if (bench_format == BENCH_FORMAT_DEFAULT)
>  		printf("# Copying %s bytes ...\n\n", size_str);
>  
> -	rt = info->do_op(r, p, src, dst);
> +	if (info->do_op(r, p, src, dst, &rt))
> +		goto out_test_failed;
>  
>  	switch (bench_format) {
>  	case BENCH_FORMAT_DEFAULT:
> @@ -194,11 +197,11 @@ static void __bench_mem_function(struct bench_mem_info *info, struct bench_param
>  		break;
>  	}
>  
> +out_test_failed:
>  out_free:
> -	free(src);
> -	free(dst);
> +	if (r->fn.fini) r->fn.fini(info, p, &src, &dst);
>  	return;
> -out_alloc_failed:
> +out_init_failed:
>  	printf("# Memory allocation failed - maybe size (%s) is too large?\n", size_str);
>  	goto out_free;
>  }
> @@ -264,8 +267,8 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
>  	fn(dst, src, size);
>  }
>  
> -static union bench_clock do_memcpy(const struct function *r, struct bench_params *p,
> -				   void *src, void *dst)
> +static int do_memcpy(const struct function *r, struct bench_params *p,
> +		     void *src, void *dst, union bench_clock *rt)
>  {
>  	union bench_clock start, end;
>  	memcpy_t fn = r->fn.memcpy;
> @@ -277,16 +280,47 @@ static union bench_clock do_memcpy(const struct function *r, struct bench_params
>  		fn(dst, src, p->size);
>  	clock_get(&end);
>  
> -	return clock_diff(&start, &end);
> +	*rt = clock_diff(&start, &end);
> +
> +	return 0;
> +}
> +
> +static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
> +		      void **src, void **dst)
> +{
> +	bool failed;
> +
> +	*dst = zalloc(p->size);
> +	failed = *dst == NULL;
> +
> +	if (info->alloc_src) {
> +		*src = zalloc(p->size);
> +		failed = failed || *src == NULL;
> +	}
> +
> +	return failed;
> +}
> +
> +static void mem_free(struct bench_mem_info *info __maybe_unused,
> +		     struct bench_params *p __maybe_unused,
> +		     void **src, void **dst)
> +{
> +	free(*dst);
> +	free(*src);
> +
> +	*dst = *src = NULL;

There's zfree() to handle free and reset together.  But probably not
needed as you want to convert it to mmap later.

Thanks,
Namhyung

>  }
>  
>  struct function memcpy_functions[] = {
>  	{ .name		= "default",
>  	  .desc		= "Default memcpy() provided by glibc",
> +	  .fn.init	= mem_alloc,
> +	  .fn.fini	= mem_free,
>  	  .fn.memcpy	= memcpy },
>  
>  #ifdef HAVE_ARCH_X86_64_SUPPORT
> -# define MEMCPY_FN(_fn, _name, _desc) {.name = _name, .desc = _desc, .fn.memcpy = _fn},
> +# define MEMCPY_FN(_fn, _init, _fini, _name, _desc)	\
> +	{.name = _name, .desc = _desc, .fn.memcpy = _fn, .fn.init = _init, .fn.fini = _fini },
>  # include "mem-memcpy-x86-64-asm-def.h"
>  # undef MEMCPY_FN
>  #endif
> @@ -311,8 +345,8 @@ int bench_mem_memcpy(int argc, const char **argv)
>  	return bench_mem_common(argc, argv, &info);
>  }
>  
> -static union bench_clock do_memset(const struct function *r, struct bench_params *p,
> -				   void *src __maybe_unused, void *dst)
> +static int do_memset(const struct function *r, struct bench_params *p,
> +		     void *src __maybe_unused, void *dst, union bench_clock *rt)
>  {
>  	union bench_clock start, end;
>  	memset_t fn = r->fn.memset;
> @@ -328,7 +362,9 @@ static union bench_clock do_memset(const struct function *r, struct bench_params
>  		fn(dst, i, p->size);
>  	clock_get(&end);
>  
> -	return clock_diff(&start, &end);
> +	*rt = clock_diff(&start, &end);
> +
> +	return 0;
>  }
>  
>  static const char * const bench_mem_memset_usage[] = {
> @@ -339,10 +375,13 @@ static const char * const bench_mem_memset_usage[] = {
>  static const struct function memset_functions[] = {
>  	{ .name		= "default",
>  	  .desc		= "Default memset() provided by glibc",
> +	  .fn.init	= mem_alloc,
> +	  .fn.fini	= mem_free,
>  	  .fn.memset	= memset },
>  
>  #ifdef HAVE_ARCH_X86_64_SUPPORT
> -# define MEMSET_FN(_fn, _name, _desc) { .name = _name, .desc = _desc, .fn.memset = _fn },
> +# define MEMSET_FN(_fn, _init, _fini, _name, _desc) \
> +	{.name = _name, .desc = _desc, .fn.memset = _fn, .fn.init = _init, .fn.fini = _fini },
>  # include "mem-memset-x86-64-asm-def.h"
>  # undef MEMSET_FN
>  #endif
> diff --git a/tools/perf/bench/mem-memcpy-arch.h b/tools/perf/bench/mem-memcpy-arch.h
> index 5bcaec5601a8..852e48cfd8fe 100644
> --- a/tools/perf/bench/mem-memcpy-arch.h
> +++ b/tools/perf/bench/mem-memcpy-arch.h
> @@ -2,7 +2,7 @@
>  
>  #ifdef HAVE_ARCH_X86_64_SUPPORT
>  
> -#define MEMCPY_FN(fn, name, desc)		\
> +#define MEMCPY_FN(fn, init, fini, name, desc)		\
>  	void *fn(void *, const void *, size_t);
>  
>  #include "mem-memcpy-x86-64-asm-def.h"
> diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
> index 6188e19d3129..f43038f4448b 100644
> --- a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
> +++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
> @@ -1,9 +1,13 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  
>  MEMCPY_FN(memcpy_orig,
> +	mem_alloc,
> +	mem_free,
>  	"x86-64-unrolled",
>  	"unrolled memcpy() in arch/x86/lib/memcpy_64.S")
>  
>  MEMCPY_FN(__memcpy,
> +	mem_alloc,
> +	mem_free,
>  	"x86-64-movsq",
>  	"movsq-based memcpy() in arch/x86/lib/memcpy_64.S")
> diff --git a/tools/perf/bench/mem-memset-arch.h b/tools/perf/bench/mem-memset-arch.h
> index 53f45482663f..278c5da12d63 100644
> --- a/tools/perf/bench/mem-memset-arch.h
> +++ b/tools/perf/bench/mem-memset-arch.h
> @@ -2,7 +2,7 @@
>  
>  #ifdef HAVE_ARCH_X86_64_SUPPORT
>  
> -#define MEMSET_FN(fn, name, desc)		\
> +#define MEMSET_FN(fn, init, fini, name, desc)	\
>  	void *fn(void *, int, size_t);
>  
>  #include "mem-memset-x86-64-asm-def.h"
> diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
> index 247c72fdfb9d..80ad1b7ea770 100644
> --- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
> +++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
> @@ -1,9 +1,13 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  
>  MEMSET_FN(memset_orig,
> +	mem_alloc,
> +	mem_free,
>  	"x86-64-unrolled",
>  	"unrolled memset() in arch/x86/lib/memset_64.S")
>  
>  MEMSET_FN(__memset,
> +	mem_alloc,
> +	mem_free,
>  	"x86-64-stosq",
>  	"movsq-based memset() in arch/x86/lib/memset_64.S")
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 05/14] perf bench mem: Switch from zalloc() to mmap()
  2025-07-10  0:59 ` [PATCH v5 05/14] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
@ 2025-07-15 20:09   ` Namhyung Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:09 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Wed, Jul 09, 2025 at 05:59:17PM -0700, Ankur Arora wrote:
> Using mmap() ensures that the buffer is always aligned at a fixed
> boundary. Switch to that to remove one source of variability.
> 
> Since we always want to read/write from the the allocated buffers map
> with pagetables pre-populated.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Reviewed-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

> ---
>  tools/perf/bench/mem-functions.c | 27 ++++++++++++++++++++++-----
>  1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index 06d3ee6f5d69..914f9048d982 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -22,9 +22,9 @@
>  #include <string.h>
>  #include <unistd.h>
>  #include <sys/time.h>
> +#include <sys/mman.h>
>  #include <errno.h>
>  #include <linux/time64.h>
> -#include <linux/zalloc.h>
>  
>  #define K 1024
>  
> @@ -285,16 +285,33 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
>  	return 0;
>  }
>  
> +static void *bench_mmap(size_t size, bool populate)
> +{
> +	void *p;
> +	int extra = populate ? MAP_POPULATE : 0;
> +
> +	p = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +		 extra | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> +
> +	return p == MAP_FAILED ? NULL : p;
> +}
> +
> +static void bench_munmap(void *p, size_t size)
> +{
> +	if (p)
> +		munmap(p, size);
> +}
> +
>  static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
>  		      void **src, void **dst)
>  {
>  	bool failed;
>  
> -	*dst = zalloc(p->size);
> +	*dst = bench_mmap(p->size, true);
>  	failed = *dst == NULL;
>  
>  	if (info->alloc_src) {
> -		*src = zalloc(p->size);
> +		*src = bench_mmap(p->size, true);
>  		failed = failed || *src == NULL;
>  	}
>  
> @@ -305,8 +322,8 @@ static void mem_free(struct bench_mem_info *info __maybe_unused,
>  		     struct bench_params *p __maybe_unused,
>  		     void **src, void **dst)
>  {
> -	free(*dst);
> -	free(*src);
> +	bench_munmap(*dst, p->size);
> +	bench_munmap(*src, p->size);
>  
>  	*dst = *src = NULL;
>  }
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages
  2025-07-10  0:59 ` [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages Ankur Arora
@ 2025-07-15 20:12   ` Namhyung Kim
  2025-07-16  2:32     ` Ankur Arora
  0 siblings, 1 reply; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:12 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Wed, Jul 09, 2025 at 05:59:18PM -0700, Ankur Arora wrote:
> Page sizes that can be selected: 4KB, 2MB, 1GB.
> 
> Both the reservation and node from which hugepages are allocated
> from are expected to be addressed by the user.

It'd be nice if you could provide an example in the commit message when
you introduce a new command line option.  Also please update
Documentation/perf-bench.txt.

Thanks,
Namhyung

> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  tools/perf/bench/mem-functions.c | 33 ++++++++++++++++++++++++++++----
>  1 file changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index 914f9048d982..e4d713587d45 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -25,11 +25,17 @@
>  #include <sys/mman.h>
>  #include <errno.h>
>  #include <linux/time64.h>
> +#include <linux/log2.h>
>  
>  #define K 1024
>  
> +#define PAGE_SHIFT_4KB		12
> +#define PAGE_SHIFT_2MB		21
> +#define PAGE_SHIFT_1GB		30
> +
>  static const char	*size_str	= "1MB";
>  static const char	*function_str	= "all";
> +static const char	*page_size_str	= "4KB";
>  static unsigned int	nr_loops	= 1;
>  static bool		use_cycles;
>  static int		cycles_fd;
> @@ -39,6 +45,10 @@ static const struct option options[] = {
>  		    "Specify the size of the memory buffers. "
>  		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
>  
> +	OPT_STRING('p', "page", &page_size_str, "4KB",
> +		    "Specify page-size for mapping memory buffers. "
> +		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
> +
>  	OPT_STRING('f', "function", &function_str, "all",
>  		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
>  
> @@ -60,6 +70,7 @@ struct bench_params {
>  	size_t		size;
>  	size_t		size_total;
>  	unsigned int	nr_loops;
> +	unsigned int	page_shift;
>  };
>  
>  struct bench_mem_info {
> @@ -202,7 +213,8 @@ static void __bench_mem_function(struct bench_mem_info *info, struct bench_param
>  	if (r->fn.fini) r->fn.fini(info, p, &src, &dst);
>  	return;
>  out_init_failed:
> -	printf("# Memory allocation failed - maybe size (%s) is too large?\n", size_str);
> +	printf("# Memory allocation failed - maybe size (%s) %s?\n", size_str,
> +			p->page_shift != PAGE_SHIFT_4KB ? "has insufficient hugepages" : "is too large");
>  	goto out_free;
>  }
>  
> @@ -210,6 +222,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  {
>  	int i;
>  	struct bench_params p = { 0 };
> +	unsigned int page_size;
>  
>  	argc = parse_options(argc, argv, options, info->usage, 0);
>  
> @@ -229,6 +242,15 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  	}
>  	p.size_total = (size_t)p.size * p.nr_loops;
>  
> +	page_size = (unsigned int)perf_atoll((char *)page_size_str);
> +	if (page_size != (1 << PAGE_SHIFT_4KB) &&
> +	    page_size != (1 << PAGE_SHIFT_2MB) &&
> +	    page_size != (1 << PAGE_SHIFT_1GB)) {
> +		fprintf(stderr, "Invalid page-size:%s\n", page_size_str);
> +		return 1;
> +	}
> +	p.page_shift = ilog2(page_size);
> +
>  	if (!strncmp(function_str, "all", 3)) {
>  		for (i = 0; info->functions[i].name; i++)
>  			__bench_mem_function(info, &p, i);
> @@ -285,11 +307,14 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
>  	return 0;
>  }
>  
> -static void *bench_mmap(size_t size, bool populate)
> +static void *bench_mmap(size_t size, bool populate, unsigned int page_shift)
>  {
>  	void *p;
>  	int extra = populate ? MAP_POPULATE : 0;
>  
> +	if (page_shift != PAGE_SHIFT_4KB)
> +		extra |= MAP_HUGETLB | (page_shift << MAP_HUGE_SHIFT);
> +
>  	p = mmap(NULL, size, PROT_READ|PROT_WRITE,
>  		 extra | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
>  
> @@ -307,11 +332,11 @@ static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
>  {
>  	bool failed;
>  
> -	*dst = bench_mmap(p->size, true);
> +	*dst = bench_mmap(p->size, true, p->page_shift);
>  	failed = *dst == NULL;
>  
>  	if (info->alloc_src) {
> -		*src = bench_mmap(p->size, true);
> +		*src = bench_mmap(p->size, true, p->page_shift);
>  		failed = failed || *src == NULL;
>  	}
>  
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region
  2025-07-10  0:59 ` [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region Ankur Arora
@ 2025-07-15 20:17   ` Namhyung Kim
  2025-07-16  2:34     ` Ankur Arora
  0 siblings, 1 reply; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:17 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Wed, Jul 09, 2025 at 05:59:19PM -0700, Ankur Arora wrote:
> There can be a significant gap in memset/memcpy performance depending
> on the size of the region being operated on.
> 
> With chunk-size=4kb:
> 
>   $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
> 
>   $ perf bench mem memset -p 4kb -k 4kb -s 4gb -l 10 -f x86-64-stosq
>   # Running 'mem/memset' benchmark:
>   # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
>   # Copying 4gb bytes ...
> 
>       13.011655 GB/sec
> 
> With chunk-size=1gb:
> 
>   $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
> 
>   $ perf bench mem memset -p 4kb -k 1gb -s 4gb -l 10 -f x86-64-stosq
>   # Running 'mem/memset' benchmark:
>   # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
>   # Copying 4gb bytes ...
> 
>       21.936355 GB/sec
> 
> So, allow the user to specify the chunk-size.
> 
> The default value is identical to the total size of the region, which
> preserves current behaviour.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Again, please update the documentation.  With that,

Reviewed-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

> ---
>  tools/perf/bench/mem-functions.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index e4d713587d45..412d18f2cb2e 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -36,6 +36,7 @@
>  static const char	*size_str	= "1MB";
>  static const char	*function_str	= "all";
>  static const char	*page_size_str	= "4KB";
> +static const char	*chunk_size_str	= "0";
>  static unsigned int	nr_loops	= 1;
>  static bool		use_cycles;
>  static int		cycles_fd;
> @@ -49,6 +50,10 @@ static const struct option options[] = {
>  		    "Specify page-size for mapping memory buffers. "
>  		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
>  
> +	OPT_STRING('k', "chunk", &chunk_size_str, "0",
> +		    "Specify the chunk-size for each invocation. "
> +		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
> +
>  	OPT_STRING('f', "function", &function_str, "all",
>  		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
>  
> @@ -69,6 +74,7 @@ union bench_clock {
>  struct bench_params {
>  	size_t		size;
>  	size_t		size_total;
> +	size_t		chunk_size;
>  	unsigned int	nr_loops;
>  	unsigned int	page_shift;
>  };
> @@ -242,6 +248,14 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  	}
>  	p.size_total = (size_t)p.size * p.nr_loops;
>  
> +	p.chunk_size = (size_t)perf_atoll((char *)chunk_size_str);
> +	if ((s64)p.chunk_size < 0 || (s64)p.chunk_size > (s64)p.size) {
> +		fprintf(stderr, "Invalid chunk_size:%s\n", chunk_size_str);
> +		return 1;
> +	}
> +	if (!p.chunk_size)
> +		p.chunk_size = p.size;
> +
>  	page_size = (unsigned int)perf_atoll((char *)page_size_str);
>  	if (page_size != (1 << PAGE_SHIFT_4KB) &&
>  	    page_size != (1 << PAGE_SHIFT_2MB) &&
> @@ -299,7 +313,8 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
>  
>  	clock_get(&start);
>  	for (unsigned int i = 0; i < p->nr_loops; ++i)
> -		fn(dst, src, p->size);
> +		for (size_t off = 0; off < p->size; off += p->chunk_size)
> +			fn(dst + off, src + off, min(p->chunk_size, p->size - off));
>  	clock_get(&end);
>  
>  	*rt = clock_diff(&start, &end);
> @@ -401,7 +416,8 @@ static int do_memset(const struct function *r, struct bench_params *p,
>  
>  	clock_get(&start);
>  	for (unsigned int i = 0; i < p->nr_loops; ++i)
> -		fn(dst, i, p->size);
> +		for (size_t off = 0; off < p->size; off += p->chunk_size)
> +			fn(dst + off, i, min(p->chunk_size, p->size - off));
>  	clock_get(&end);
>  
>  	*rt = clock_diff(&start, &end);
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 08/14] perf bench mem: Refactor mem_options
  2025-07-10  0:59 ` [PATCH v5 08/14] perf bench mem: Refactor mem_options Ankur Arora
@ 2025-07-15 20:18   ` Namhyung Kim
  0 siblings, 0 replies; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:18 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Wed, Jul 09, 2025 at 05:59:20PM -0700, Ankur Arora wrote:
> Split mem benchmark options into common and memset/memcpy specific.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Reviewed-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

> ---
>  tools/perf/bench/mem-functions.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index 412d18f2cb2e..8a37da149327 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -41,7 +41,7 @@ static unsigned int	nr_loops	= 1;
>  static bool		use_cycles;
>  static int		cycles_fd;
>  
> -static const struct option options[] = {
> +static const struct option bench_common_options[] = {
>  	OPT_STRING('s', "size", &size_str, "1MB",
>  		    "Specify the size of the memory buffers. "
>  		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
> @@ -50,10 +50,6 @@ static const struct option options[] = {
>  		    "Specify page-size for mapping memory buffers. "
>  		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
>  
> -	OPT_STRING('k', "chunk", &chunk_size_str, "0",
> -		    "Specify the chunk-size for each invocation. "
> -		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
> -
>  	OPT_STRING('f', "function", &function_str, "all",
>  		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
>  
> @@ -66,6 +62,14 @@ static const struct option options[] = {
>  	OPT_END()
>  };
>  
> +static const struct option bench_mem_options[] = {
> +	OPT_STRING('k', "chunk", &chunk_size_str, "0",
> +		    "Specify the chunk-size for each invocation. "
> +		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
> +	OPT_PARENT(bench_common_options),
> +	OPT_END()
> +};
> +
>  union bench_clock {
>  	u64		cycles;
>  	struct timeval	tv;
> @@ -84,6 +88,7 @@ struct bench_mem_info {
>  	int (*do_op)(const struct function *r, struct bench_params *p,
>  		     void *src, void *dst, union bench_clock *rt);
>  	const char *const *usage;
> +	const struct option *options;
>  	bool alloc_src;
>  };
>  
> @@ -230,7 +235,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  	struct bench_params p = { 0 };
>  	unsigned int page_size;
>  
> -	argc = parse_options(argc, argv, options, info->usage, 0);
> +	argc = parse_options(argc, argv, info->options, info->usage, 0);
>  
>  	if (use_cycles) {
>  		i = init_cycles();
> @@ -396,6 +401,7 @@ int bench_mem_memcpy(int argc, const char **argv)
>  		.functions		= memcpy_functions,
>  		.do_op			= do_memcpy,
>  		.usage			= bench_mem_memcpy_usage,
> +		.options		= bench_mem_options,
>  		.alloc_src              = true,
>  	};
>  
> @@ -453,6 +459,7 @@ int bench_mem_memset(int argc, const char **argv)
>  		.functions		= memset_functions,
>  		.do_op			= do_memset,
>  		.usage			= bench_mem_memset_usage,
> +		.options		= bench_mem_options,
>  	};
>  
>  	return bench_mem_common(argc, argv, &info);
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 09/14] perf bench mem: Add mmap() workloads
  2025-07-10  0:59 ` [PATCH v5 09/14] perf bench mem: Add mmap() workloads Ankur Arora
@ 2025-07-15 20:20   ` Namhyung Kim
  2025-07-16  2:40     ` Ankur Arora
  0 siblings, 1 reply; 52+ messages in thread
From: Namhyung Kim @ 2025-07-15 20:20 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, bp, dave.hansen, hpa,
	mingo, mjguzik, luto, peterz, acme, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Wed, Jul 09, 2025 at 05:59:21PM -0700, Ankur Arora wrote:
> Add two mmap() workloads: one that eagerly populates a region and
> another that demand faults it in.
> 
> The intent is to probe the memory subsytem performance incurred
> by mmap().

Maybe better to name 'mmap' as other tests named after the actual
function.  Also please update the documentation.

Thanks,
Namhyung


> 
>   $ perf bench mem map -s 4gb -p 4kb -l 10 -f populate
>   # Running 'mem/map' benchmark:
>   # function 'populate' (Eagerly populated map)
>   # Copying 4gb bytes ...
> 
>        1.811691 GB/sec
> 
>   $ perf bench mem map -s 4gb -p 2mb -l 10 -f populate
>   # Running 'mem/map' benchmark:
>   # function 'populate' (Eagerly populated map)
>   # Copying 4gb bytes ...
> 
>       12.272017 GB/sec
> 
>   $ perf bench mem map -s 4gb -p 1gb -l 10 -f populate
>   # Running 'mem/map' benchmark:
>   # function 'populate' (Eagerly populated map)
>   # Copying 4gb bytes ...
> 
>       17.085927 GB/sec
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  tools/perf/bench/bench.h         |  1 +
>  tools/perf/bench/mem-functions.c | 96 ++++++++++++++++++++++++++++++++
>  tools/perf/builtin-bench.c       |  1 +
>  3 files changed, 98 insertions(+)
> 
> diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
> index 9f736423af53..46484bb0eefb 100644
> --- a/tools/perf/bench/bench.h
> +++ b/tools/perf/bench/bench.h
> @@ -28,6 +28,7 @@ int bench_syscall_fork(int argc, const char **argv);
>  int bench_syscall_execve(int argc, const char **argv);
>  int bench_mem_memcpy(int argc, const char **argv);
>  int bench_mem_memset(int argc, const char **argv);
> +int bench_mem_map(int argc, const char **argv);
>  int bench_mem_find_bit(int argc, const char **argv);
>  int bench_futex_hash(int argc, const char **argv);
>  int bench_futex_wake(int argc, const char **argv);
> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
> index 8a37da149327..ea62e3583a70 100644
> --- a/tools/perf/bench/mem-functions.c
> +++ b/tools/perf/bench/mem-functions.c
> @@ -40,6 +40,7 @@ static const char	*chunk_size_str	= "0";
>  static unsigned int	nr_loops	= 1;
>  static bool		use_cycles;
>  static int		cycles_fd;
> +static unsigned int	seed;
>  
>  static const struct option bench_common_options[] = {
>  	OPT_STRING('s', "size", &size_str, "1MB",
> @@ -81,6 +82,7 @@ struct bench_params {
>  	size_t		chunk_size;
>  	unsigned int	nr_loops;
>  	unsigned int	page_shift;
> +	unsigned int	seed;
>  };
>  
>  struct bench_mem_info {
> @@ -98,6 +100,7 @@ typedef void (*mem_fini_t)(struct bench_mem_info *, struct bench_params *,
>  			   void **, void **);
>  typedef void *(*memcpy_t)(void *, const void *, size_t);
>  typedef void *(*memset_t)(void *, int, size_t);
> +typedef void (*map_op_t)(void *, size_t, unsigned int, bool);
>  
>  struct function {
>  	const char *name;
> @@ -108,6 +111,7 @@ struct function {
>  		union {
>  			memcpy_t memcpy;
>  			memset_t memset;
> +			map_op_t map_op;
>  		};
>  	} fn;
>  };
> @@ -160,6 +164,14 @@ static union bench_clock clock_diff(union bench_clock *s, union bench_clock *e)
>  	return t;
>  }
>  
> +static void clock_accum(union bench_clock *a, union bench_clock *b)
> +{
> +	if (use_cycles)
> +		a->cycles += b->cycles;
> +	else
> +		timeradd(&a->tv, &b->tv, &a->tv);
> +}
> +
>  static double timeval2double(struct timeval *ts)
>  {
>  	return (double)ts->tv_sec + (double)ts->tv_usec / (double)USEC_PER_SEC;
> @@ -270,6 +282,8 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>  	}
>  	p.page_shift = ilog2(page_size);
>  
> +	p.seed = seed;
> +
>  	if (!strncmp(function_str, "all", 3)) {
>  		for (i = 0; info->functions[i].name; i++)
>  			__bench_mem_function(info, &p, i);
> @@ -464,3 +478,85 @@ int bench_mem_memset(int argc, const char **argv)
>  
>  	return bench_mem_common(argc, argv, &info);
>  }
> +
> +static void map_page_touch(void *dst, size_t size, unsigned int page_shift, bool random)
> +{
> +	unsigned long npages = size / (1 << page_shift);
> +	unsigned long offset = 0, r = 0;
> +
> +	for (unsigned long i = 0; i < npages; i++) {
> +		if (random)
> +			r = rand() % (1 << page_shift);
> +
> +		*((char *)dst + offset + r) = *(char *)(dst + offset + r) + i;
> +		offset += 1 << page_shift;
> +	}
> +}
> +
> +static int do_map(const struct function *r, struct bench_params *p,
> +		  void *src __maybe_unused, void *dst __maybe_unused,
> +		  union bench_clock *accum)
> +{
> +	union bench_clock start, end, diff;
> +	map_op_t fn = r->fn.map_op;
> +	bool populate = strcmp(r->name, "populate") == 0;
> +
> +	if (p->seed)
> +		srand(p->seed);
> +
> +	for (unsigned int i = 0; i < p->nr_loops; i++) {
> +		clock_get(&start);
> +		dst = bench_mmap(p->size, populate, p->page_shift);
> +		if (!dst)
> +			goto out;
> +
> +		fn(dst, p->size, p->page_shift, p->seed);
> +		clock_get(&end);
> +		diff = clock_diff(&start, &end);
> +		clock_accum(accum, &diff);
> +
> +		bench_munmap(dst, p->size);
> +	}
> +
> +	return 0;
> +out:
> +	printf("# Memory allocation failed - maybe size (%s) %s?\n", size_str,
> +			p->page_shift != PAGE_SHIFT_4KB ? "has insufficient hugepages" : "is too large");
> +	return -1;
> +}
> +
> +static const char * const bench_mem_map_usage[] = {
> +	"perf bench mem map <options>",
> +	NULL
> +};
> +
> +static const struct function map_functions[] = {
> +	{ .name		= "populate",
> +	  .desc		= "Eagerly populated map",
> +	  .fn.map_op	= map_page_touch },
> +
> +	{ .name		= "demand",
> +	  .desc		= "Demand loaded map",
> +	  .fn.map_op	= map_page_touch },
> +
> +	{ .name = NULL, }
> +};
> +
> +int bench_mem_map(int argc, const char **argv)
> +{
> +	static const struct option bench_map_options[] = {
> +		OPT_UINTEGER('r', "randomize", &seed,
> +			    "Seed to randomize page RW offset with."),
> +		OPT_PARENT(bench_common_options),
> +		OPT_END()
> +	};
> +
> +	struct bench_mem_info info = {
> +		.functions		= map_functions,
> +		.do_op			= do_map,
> +		.usage			= bench_mem_map_usage,
> +		.options		= bench_map_options,
> +	};
> +
> +	return bench_mem_common(argc, argv, &info);
> +}
> diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
> index 2c1a9f3d847a..a20bd9882f0a 100644
> --- a/tools/perf/builtin-bench.c
> +++ b/tools/perf/builtin-bench.c
> @@ -65,6 +65,7 @@ static struct bench mem_benchmarks[] = {
>  	{ "memcpy",	"Benchmark for memcpy() functions",		bench_mem_memcpy	},
>  	{ "memset",	"Benchmark for memset() functions",		bench_mem_memset	},
>  	{ "find_bit",	"Benchmark for find_bit() functions",		bench_mem_find_bit	},
> +	{ "map",	"Benchmark for mmap() mappings",		bench_mem_map		},
>  	{ "all",	"Run all memory access benchmarks",		NULL			},
>  	{ NULL,		NULL,						NULL			}
>  };
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 12/14] mm: add config option for clearing page-extents
  2025-07-14 20:35         ` Ankur Arora
@ 2025-07-15 20:59           ` David Hildenbrand
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-07-15 20:59 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, ziy

On 14.07.25 22:35, Ankur Arora wrote:
> 
> [ Added Zi Yan. ]
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 11.07.25 19:32, Ankur Arora wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 10.07.25 02:59, Ankur Arora wrote:
>>>>> Add CONFIG_CLEAR_PAGE_EXTENT to allow clearing of page-extents
>>>>> where architecturally supported.
>>>>> This is only available with !CONFIG_HIGHMEM because the intent is to
>>>>> use architecture support to clear contiguous extents in a single
>>>>> operation (ex. via FEAT_MOPS on arm64, string instructions on x86)
>>>>> which excludes any possibility of interspersing kmap()/kunmap().
>>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>>> ---
>>>>
>>>> Staring at the next patch, I think this can easily be squashed into the next
>>>> patch where you add actual MM core support.
>>> I wanted to do this in a separate patch to explicitly document what the
>>> responsibility of the interface provided by the architecture is.
>>> That said, the commit message didn't actually do a good job of doing
>>> that :D.
>>> Copying the more detailed commit message from my reply to Andrew,
>>> one important part of the clear_pages() is that it be interruptible
>>> because clear_pages_resched() implicitly depends on it.
>>>
>>>> This is only enabled with !CONFIG_HIGHMEM because the intent is
>>>> to use architecture support to clear contiguous extents in a
>>>> single interruptible operation (ex. via FEAT_MOPS on arm64,
>>>> string instructions on x86).
>>>
>>>> Given that we might be zeroing the whole extent with a single
>>>> instruction, this excludes any possibility of constructing
>>>> intermediate HIGHMEM maps.
>>> Do you think it is best documented in the next patch in a comment
>>> instead?
>>
>> I would just add + document it as part of the next patch.
>>
>> Looking at the bigger picture now, you introduce
>>
>> 	ARCH_HAS_CLEAR_PAGES
>>
>> To say whether an architecture provides clear_pages().
>>
>> Now we want to conditionally use that to optimize folio_zero_user().
>>
>> Remind me, why do we want to glue this to THP / HUGETLBFS only? I would assume
>> that the code footprint is rather small, and the systems out there that are
>> compiled with ARCH_HAS_CLEAR_PAGES but without THP / HUGETLBFS are rather ...
>> rare (mostly 32BIT x86 only).

Agreed.

> 
> I thought about this some more and there are a few other interfaces that
> end up clearing pages:
> 
>> clear_highpage()
>> clear_highpage_kasan_tagged()
>> tag_clear_highpage()
> 
> In this set, there are many loops of the form:
> 
>     for (i = 0; i < n; i++)
>         clear_highpage();
> 
> At least some of these (including kernel_init_pages()) could be migrated
> to variations on a clear_highpages() which could be:
> 
>      static inline void clear_highpages(struct page *page, u32 num_pages)
>      {
>              if (!IS_ENABLED(CONFIG_HIGHMEM))
>                      clear_pages_resched(page, num_pages);
>              else
>                      for (i = 0; i < num_pages; ++i)
>                              clear_highpage(page + i);
>      }
> 
> (clear_pages_resched() should be safe to be used from here because
> everybody using this should be in a schedulable context.)
> 
> (The kernel_init_pages() was also suggested by Zi Yan in a review of v3 [1].)
> 
>> clear_user_highpage()
> 
> Only users folio_zero_user(), __collapse_huge_page_copy() and
> userfaultd.
> 
>> clear_user_page()
> Not many users apart from the highmem interface.
> 
>> clear_page()
> 
> Not many users apart from the highmem interface.
> 
> I'm happy to do this work, just not sure how to stage it. In particular I
> would like to avoid a series which tries to address all of the cases.
> 
> Maybe it makes sense to handle just add the clear_highpages() variants,
> folio_zero_user() handling and some of the obvious users of
> clear_highpages() for v6?

Yes, no need for excessive handling.

What I was getting at was: could we get rid of the kconfig option and 
simply glue it to the availability of clear_pages() in a reasonable way.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
  2025-07-10  0:59 ` [PATCH v5 13/14] mm: memory: support " Ankur Arora
  2025-07-11 11:44   ` David Hildenbrand
@ 2025-07-15 22:08   ` David Hildenbrand
  2025-07-16  3:19     ` Ankur Arora
  1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-07-15 22:08 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On 10.07.25 02:59, Ankur Arora wrote:
> folio_zero_user() is constrained to clear in a page-at-a-time
> fashion because it supports CONFIG_HIGHMEM which means that kernel
> mappings for pages in a folio are not guaranteed to be contiguous.
> 
> We don't have this problem when running under configurations with
> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
> longer page-extents.
> This is expected to be faster because the processor can now optimize
> the clearing based on the knowledge of the extent.
> 
> However, clearing in larger chunks can have two other problems:
> 
>   - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>     (larger folios don't have any expectation of cache locality).
> 
>   - preemption latency when clearing large folios.
> 
> Handle the first by splitting the clearing in three parts: the
> faulting page and its immediate locality, its left and right
> regions; the local neighbourhood is cleared last.
> 
> The second problem is relevant only when running under cooperative
> preemption models. Limit the worst case preemption latency by clearing
> in architecture specified ARCH_CLEAR_PAGE_EXTENT units.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>   mm/memory.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..c52806270375 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7034,6 +7034,7 @@ static inline int process_huge_page(
>   	return 0;
>   }
>   
> +#ifndef CONFIG_CLEAR_PAGE_EXTENT
>   static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>   				unsigned int nr_pages)
>   {
> @@ -7058,7 +7059,10 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
>   /**
>    * folio_zero_user - Zero a folio which will be mapped to userspace.
>    * @folio: The folio to zero.
> - * @addr_hint: The address will be accessed or the base address if uncelar.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * folio_zero_user() uses clear_gigantic_page() or process_huge_page() to
> + * do page-at-a-time zeroing because it needs to handle CONFIG_HIGHMEM.
>    */
>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>   {
> @@ -7070,6 +7074,86 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>   		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>   }
>   
> +#else /* CONFIG_CLEAR_PAGE_EXTENT */
> +
> +static void clear_pages_resched(void *addr, int npages)
> +{
> +	int i, remaining;
> +
> +	if (preempt_model_preemptible()) {
> +		clear_pages(addr, npages);
> +		goto out;
> +	}
> +
> +	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
> +			    ARCH_CLEAR_PAGE_EXTENT);
> +		cond_resched();
> +	}
> +
> +	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
> +
> +	if (remaining)
> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
> +			    remaining);
> +out:
> +	cond_resched();
> +}
> +
> +/*
> + * folio_zero_user - Zero a folio which will be mapped to userspace.
> + * @folio: The folio to zero.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * Uses architectural support for clear_pages() to zero page extents
> + * instead of clearing page-at-a-time.
> + *
> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
> + * pages in the immediate locality of the faulting page, and its left, right
> + * regions; the local neighbourhood cleared last in order to keep cache
> + * lines of the target region hot.
> + *
> + * For larger folios we assume that there is no expectation of cache locality
> + * and just do a straight zero.
> + */
> +void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> +{
> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> +	const int width = 2; /* number of pages cleared last on either side */
> +	struct range r[3];
> +	int i;
> +
> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> +		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
> +		return;
> +	}
> +
> +	/*
> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
> +	 * ensure it sticks around in the cache.
> +	 */
> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
> +
> +	/* Region to the left of the fault */
> +	r[1] = DEFINE_RANGE(pg.start,
> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
> +
> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> +			    pg.end);
> +
> +	for (i = 0; i <= 2; i++) {
> +		int npages = range_len(&r[i]);
> +
> +		if (npages > 0)
> +			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
> +	}
> +}
> +#endif /* CONFIG_CLEAR_PAGE_EXTENT */
> +
>   static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>   				   unsigned long addr_hint,
>   				   struct vm_area_struct *vma,

So, folio_zero_user() is only compiled with THP | HUGETLB already.

What we should probably do is scrap the whole new kconfig option and
do something like this in here:

diff --git a/mm/memory.c b/mm/memory.c
index 3dd6c57e6511e..64b6bd3e7657a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7009,19 +7009,53 @@ static inline int process_huge_page(
  	return 0;
  }
  
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
-				unsigned int nr_pages)
+#ifdef CONFIG_ARCH_HAS_CLEAR_PAGES
+static void clear_user_highpages_resched(struct page *page,
+		unsigned int nr_pages, unsigned long addr)
+{
+	void *addr = page_address(page);
+	int i, remaining;
+
+	/*
+	 * CONFIG_ARCH_HAS_CLEAR_PAGES is not expected to be set on systems
+	 * with HIGHMEM, so we can safely use clear_pages().
+	 */
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHMEM));
+
+	if (preempt_model_preemptible()) {
+		clear_pages(addr, npages);
+		goto out;
+	}
+
+	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
+		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
+			    ARCH_CLEAR_PAGE_EXTENT);
+		cond_resched();
+	}
+
+	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
+
+	if (remaining)
+		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
+			    remaining);
+out:
+	cond_resched();
+}
+#else
+static void clear_user_highpages_resched(struct page *page,
+		unsigned int nr_pages, unsigned long addr)
  {
-	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
  	int i;
  
  	might_sleep();
  	for (i = 0; i < nr_pages; i++) {
  		cond_resched();
-		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
+		clear_user_highpage(nth_page(page, i), addr + i * PAGE_SIZE);
  	}
  }
  
+#endif /* CONFIG_ARCH_HAS_CLEAR_PAGES */
+
  static int clear_subpage(unsigned long addr, int idx, void *arg)
  {
  	struct folio *folio = arg;
@@ -7030,19 +7064,76 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
  	return 0;
  }
  
-/**
+static void folio_zero_user_huge(struct folio *folio, unsigned long addr_hint)
+{
+	const unsigned int nr_pages = folio_nr_pages(folio);
+	const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
+	const long fault_idx = (addr_hint - addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, nr_pages - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
+
+	/*
+	 * Without an optimized clear_user_highpages_resched(), we'll perform
+	 * some extra magic dance around the faulting address.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_CLEAR_PAGES)) {
+		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		unsigned int cur_nr_pages = range_len(&r[i]);
+		struct page *cur_page = folio_page(folio, r[i].start);
+		unsigned long cur_addr = addr + folio_page_idx(folio, cur_page) * PAGE_SIZE;
+
+		if (cur_nr_pages > 0)
+			clear_user_highpages_resched(cur_page, cur_nr_pages, cur_addr);
+	}
+}
+
+/*
   * folio_zero_user - Zero a folio which will be mapped to userspace.
   * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
   */
  void folio_zero_user(struct folio *folio, unsigned long addr_hint)
  {
-	unsigned int nr_pages = folio_nr_pages(folio);
+	const unsigned int nr_pages = folio_nr_pages(folio);
+	const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
  
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
-	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+	if (unlikely(nr_pages >= MAX_ORDER_NR_PAGES)) {
+		clear_user_highpages_resched(folio_page(folio, 0), nr_pages, addr);
+		return;
+	}
+	folio_zero_user_huge(folio, addr_hint);
  }
  
  static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.50.1



Note that this probably completely broken in various ways, just to give you
an idea.

*maybe* we could change clear_user_highpages_resched to something like
folio_zero_user_range(), consuming a folio + idx instead of a page. That might
or might not be better here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 02/14] perf bench mem: Defer type munging of size to float
  2025-07-15 20:05   ` Namhyung Kim
@ 2025-07-16  2:17     ` Ankur Arora
  0 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-16  2:17 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, david, bp,
	dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


Namhyung Kim <namhyung@kernel.org> writes:

> On Wed, Jul 09, 2025 at 05:59:14PM -0700, Ankur Arora wrote:
>> Do type conversion to double at the point of use.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>
> Reviewed-by: Namhyung Kim <namhyung@kernel.org>
>
> A nitpick below.
>
>> ---
>>  tools/perf/bench/mem-functions.c | 14 +++++++-------
>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
>> index 8599ed96ee1f..b8f020379197 100644
>> --- a/tools/perf/bench/mem-functions.c
>> +++ b/tools/perf/bench/mem-functions.c
>> @@ -139,7 +139,7 @@ struct bench_mem_info {
>>  	bool alloc_src;
>>  };
>>
>> -static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, double size_total)
>> +static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, size_t size_total)
>>  {
>>  	const struct function *r = &info->functions[r_idx];
>>  	double result_bps = 0.0;
>> @@ -165,18 +165,18 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
>>  	switch (bench_format) {
>>  	case BENCH_FORMAT_DEFAULT:
>>  		if (use_cycles) {
>> -			printf(" %14lf cycles/byte\n", (double)rt.cycles/size_total);
>> +			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)size_total);
>>  		} else {
>> -			result_bps = size_total/timeval2double(&rt.tv);
>> +			result_bps = (double)size_total/timeval2double(&rt.tv);
>>  			print_bps(result_bps);
>>  		}
>>  		break;
>>
>>  	case BENCH_FORMAT_SIMPLE:
>>  		if (use_cycles) {
>> -			printf("%lf\n", (double)rt.cycles/size_total);
>> +			printf("%lf\n", (double)rt.cycles/(double)size_total);
>>  		} else {
>> -			result_bps = size_total/timeval2double(&rt.tv);
>> +			result_bps = (double)size_total/timeval2double(&rt.tv);
>>  			printf("%lf\n", result_bps);
>>  		}
>>  		break;
>> @@ -199,7 +199,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>>  {
>>  	int i;
>>  	size_t size;
>> -	double size_total;
>> +	size_t size_total;
>>
>>  	argc = parse_options(argc, argv, options, info->usage, 0);
>>
>> @@ -212,7 +212,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
>>  	}
>>
>>  	size = (size_t)perf_atoll((char *)size_str);
>> -	size_total = (double)size * nr_loops;
>> +	size_total = (size_t)size * nr_loops;
>
> No need to cast.

Ah, yes.

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages
  2025-07-15 20:12   ` Namhyung Kim
@ 2025-07-16  2:32     ` Ankur Arora
  0 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-16  2:32 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, david, bp,
	dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


Namhyung Kim <namhyung@kernel.org> writes:

> On Wed, Jul 09, 2025 at 05:59:18PM -0700, Ankur Arora wrote:
>> Page sizes that can be selected: 4KB, 2MB, 1GB.
>>
>> Both the reservation and node from which hugepages are allocated
>> from are expected to be addressed by the user.
>
> It'd be nice if you could provide an example in the commit message when
> you introduce a new command line option.  Also please update
> Documentation/perf-bench.txt.

Will do.

Thanks for all the reviews!

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region
  2025-07-15 20:17   ` Namhyung Kim
@ 2025-07-16  2:34     ` Ankur Arora
  0 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-16  2:34 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, david, bp,
	dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


Namhyung Kim <namhyung@kernel.org> writes:

> On Wed, Jul 09, 2025 at 05:59:19PM -0700, Ankur Arora wrote:
>> There can be a significant gap in memset/memcpy performance depending
>> on the size of the region being operated on.
>>
>> With chunk-size=4kb:
>>
>>   $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
>>
>>   $ perf bench mem memset -p 4kb -k 4kb -s 4gb -l 10 -f x86-64-stosq
>>   # Running 'mem/memset' benchmark:
>>   # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
>>   # Copying 4gb bytes ...
>>
>>       13.011655 GB/sec
>>
>> With chunk-size=1gb:
>>
>>   $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
>>
>>   $ perf bench mem memset -p 4kb -k 1gb -s 4gb -l 10 -f x86-64-stosq
>>   # Running 'mem/memset' benchmark:
>>   # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
>>   # Copying 4gb bytes ...
>>
>>       21.936355 GB/sec
>>
>> So, allow the user to specify the chunk-size.
>>
>> The default value is identical to the total size of the region, which
>> preserves current behaviour.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>
> Again, please update the documentation.  With that,
>
> Reviewed-by: Namhyung Kim <namhyung@kernel.org>
>
Thanks! Will do.

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 09/14] perf bench mem: Add mmap() workloads
  2025-07-15 20:20   ` Namhyung Kim
@ 2025-07-16  2:40     ` Ankur Arora
  0 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-16  2:40 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, david, bp,
	dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


Namhyung Kim <namhyung@kernel.org> writes:

> On Wed, Jul 09, 2025 at 05:59:21PM -0700, Ankur Arora wrote:
>> Add two mmap() workloads: one that eagerly populates a region and
>> another that demand faults it in.
>>
>> The intent is to probe the memory subsytem performance incurred
>> by mmap().
>
> Maybe better to name 'mmap' as other tests named after the actual
> function.

That's true. My reason for not doing it was that these tests were
oriented towards testing the memory subsystem instead of the system
call itself.

But I guess "map" is not a good name either. Any suggestions?

> Also please update the documentation.

Will do.

Thanks
Ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
  2025-07-15 22:08   ` David Hildenbrand
@ 2025-07-16  3:19     ` Ankur Arora
  2025-07-16  8:03       ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Ankur Arora @ 2025-07-16  3:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 10.07.25 02:59, Ankur Arora wrote:
>> folio_zero_user() is constrained to clear in a page-at-a-time
>> fashion because it supports CONFIG_HIGHMEM which means that kernel
>> mappings for pages in a folio are not guaranteed to be contiguous.
>> We don't have this problem when running under configurations with
>> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
>> longer page-extents.
>> This is expected to be faster because the processor can now optimize
>> the clearing based on the knowledge of the extent.
>> However, clearing in larger chunks can have two other problems:
>>   - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>>     (larger folios don't have any expectation of cache locality).
>>   - preemption latency when clearing large folios.
>> Handle the first by splitting the clearing in three parts: the
>> faulting page and its immediate locality, its left and right
>> regions; the local neighbourhood is cleared last.
>> The second problem is relevant only when running under cooperative
>> preemption models. Limit the worst case preemption latency by clearing
>> in architecture specified ARCH_CLEAR_PAGE_EXTENT units.
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   mm/memory.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 85 insertions(+), 1 deletion(-)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b0cda5aab398..c52806270375 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7034,6 +7034,7 @@ static inline int process_huge_page(
>>   	return 0;
>>   }
>>   +#ifndef CONFIG_CLEAR_PAGE_EXTENT
>>   static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>>   				unsigned int nr_pages)
>>   {
>> @@ -7058,7 +7059,10 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
>>   /**
>>    * folio_zero_user - Zero a folio which will be mapped to userspace.
>>    * @folio: The folio to zero.
>> - * @addr_hint: The address will be accessed or the base address if uncelar.
>> + * @addr_hint: The address accessed by the user or the base address.
>> + *
>> + * folio_zero_user() uses clear_gigantic_page() or process_huge_page() to
>> + * do page-at-a-time zeroing because it needs to handle CONFIG_HIGHMEM.
>>    */
>>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>   {
>> @@ -7070,6 +7074,86 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>   		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>   }
>>   +#else /* CONFIG_CLEAR_PAGE_EXTENT */
>> +
>> +static void clear_pages_resched(void *addr, int npages)
>> +{
>> +	int i, remaining;
>> +
>> +	if (preempt_model_preemptible()) {
>> +		clear_pages(addr, npages);
>> +		goto out;
>> +	}
>> +
>> +	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
>> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
>> +			    ARCH_CLEAR_PAGE_EXTENT);
>> +		cond_resched();
>> +	}
>> +
>> +	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
>> +
>> +	if (remaining)
>> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
>> +			    remaining);
>> +out:
>> +	cond_resched();
>> +}
>> +
>> +/*
>> + * folio_zero_user - Zero a folio which will be mapped to userspace.
>> + * @folio: The folio to zero.
>> + * @addr_hint: The address accessed by the user or the base address.
>> + *
>> + * Uses architectural support for clear_pages() to zero page extents
>> + * instead of clearing page-at-a-time.
>> + *
>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>> + * pages in the immediate locality of the faulting page, and its left, right
>> + * regions; the local neighbourhood cleared last in order to keep cache
>> + * lines of the target region hot.
>> + *
>> + * For larger folios we assume that there is no expectation of cache locality
>> + * and just do a straight zero.
>> + */
>> +void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>> +{
>> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> +	const int width = 2; /* number of pages cleared last on either side */
>> +	struct range r[3];
>> +	int i;
>> +
>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> +		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>> +	 * ensure it sticks around in the cache.
>> +	 */
>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>> +
>> +	/* Region to the left of the fault */
>> +	r[1] = DEFINE_RANGE(pg.start,
>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>> +
>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>> +			    pg.end);
>> +
>> +	for (i = 0; i <= 2; i++) {
>> +		int npages = range_len(&r[i]);
>> +
>> +		if (npages > 0)
>> +			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
>> +	}
>> +}
>> +#endif /* CONFIG_CLEAR_PAGE_EXTENT */
>> +
>>   static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>>   				   unsigned long addr_hint,
>>   				   struct vm_area_struct *vma,
>
> So, folio_zero_user() is only compiled with THP | HUGETLB already.
>
> What we should probably do is scrap the whole new kconfig option and
> do something like this in here:

So, in principle I don't disagree and unifying both of these is cleaner
than introducing a whole new option.

However that still leaves this code having to contort around CONFIG_HIGHMEM
which is probably even less frequently used than THP | HUGETLB.

Maybe we should get rid of ARCH_HAS_CLEAR_PAGES completely and everyone
with !HIGHMEM either use a generic version of clear_pages() which loops
and calls clear_page() or some architectural override.

And, then we can do a similar transformation with copy_pages() (and
copy_user_large_folio()).

At that point, process_huge_page() is used only for !HIGHMEM configs
configs which likely have relatively small caches and so that leaves
it probably over-engineered.

The thing that gives me pause is that non-x86 might perform worse
when they switch away from the left-right zeroing approach in
process_huge_page() to a generic clear_pages().

So, maybe allowing architectures to opt in by having to define
ARCH_HAS_CLEAR_PAGES would allow doing this in a more measured fashion.

(There isn't much justification for the left-right zeroing in the
commit: c79b57e462b "mm: hugetlb: clear target sub-page last when
clearing huge page". Just that it performs better on a benchmark:
anon-w-seq.)


Thanks!

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
  2025-07-16  3:19     ` Ankur Arora
@ 2025-07-16  8:03       ` David Hildenbrand
  2025-07-16 17:54         ` Ankur Arora
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-07-16  8:03 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On 16.07.25 05:19, Ankur Arora wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 10.07.25 02:59, Ankur Arora wrote:
>>> folio_zero_user() is constrained to clear in a page-at-a-time
>>> fashion because it supports CONFIG_HIGHMEM which means that kernel
>>> mappings for pages in a folio are not guaranteed to be contiguous.
>>> We don't have this problem when running under configurations with
>>> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
>>> longer page-extents.
>>> This is expected to be faster because the processor can now optimize
>>> the clearing based on the knowledge of the extent.
>>> However, clearing in larger chunks can have two other problems:
>>>    - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>>>      (larger folios don't have any expectation of cache locality).
>>>    - preemption latency when clearing large folios.
>>> Handle the first by splitting the clearing in three parts: the
>>> faulting page and its immediate locality, its left and right
>>> regions; the local neighbourhood is cleared last.
>>> The second problem is relevant only when running under cooperative
>>> preemption models. Limit the worst case preemption latency by clearing
>>> in architecture specified ARCH_CLEAR_PAGE_EXTENT units.
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>>    mm/memory.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 85 insertions(+), 1 deletion(-)
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index b0cda5aab398..c52806270375 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -7034,6 +7034,7 @@ static inline int process_huge_page(
>>>    	return 0;
>>>    }
>>>    +#ifndef CONFIG_CLEAR_PAGE_EXTENT
>>>    static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>>>    				unsigned int nr_pages)
>>>    {
>>> @@ -7058,7 +7059,10 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
>>>    /**
>>>     * folio_zero_user - Zero a folio which will be mapped to userspace.
>>>     * @folio: The folio to zero.
>>> - * @addr_hint: The address will be accessed or the base address if uncelar.
>>> + * @addr_hint: The address accessed by the user or the base address.
>>> + *
>>> + * folio_zero_user() uses clear_gigantic_page() or process_huge_page() to
>>> + * do page-at-a-time zeroing because it needs to handle CONFIG_HIGHMEM.
>>>     */
>>>    void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>    {
>>> @@ -7070,6 +7074,86 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>    		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>>    }
>>>    +#else /* CONFIG_CLEAR_PAGE_EXTENT */
>>> +
>>> +static void clear_pages_resched(void *addr, int npages)
>>> +{
>>> +	int i, remaining;
>>> +
>>> +	if (preempt_model_preemptible()) {
>>> +		clear_pages(addr, npages);
>>> +		goto out;
>>> +	}
>>> +
>>> +	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
>>> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
>>> +			    ARCH_CLEAR_PAGE_EXTENT);
>>> +		cond_resched();
>>> +	}
>>> +
>>> +	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
>>> +
>>> +	if (remaining)
>>> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
>>> +			    remaining);
>>> +out:
>>> +	cond_resched();
>>> +}
>>> +
>>> +/*
>>> + * folio_zero_user - Zero a folio which will be mapped to userspace.
>>> + * @folio: The folio to zero.
>>> + * @addr_hint: The address accessed by the user or the base address.
>>> + *
>>> + * Uses architectural support for clear_pages() to zero page extents
>>> + * instead of clearing page-at-a-time.
>>> + *
>>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>>> + * pages in the immediate locality of the faulting page, and its left, right
>>> + * regions; the local neighbourhood cleared last in order to keep cache
>>> + * lines of the target region hot.
>>> + *
>>> + * For larger folios we assume that there is no expectation of cache locality
>>> + * and just do a straight zero.
>>> + */
>>> +void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>> +{
>>> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>>> +	const int width = 2; /* number of pages cleared last on either side */
>>> +	struct range r[3];
>>> +	int i;
>>> +
>>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>>> +		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>>> +		return;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>>> +	 * ensure it sticks around in the cache.
>>> +	 */
>>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>>> +
>>> +	/* Region to the left of the fault */
>>> +	r[1] = DEFINE_RANGE(pg.start,
>>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>>> +
>>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>>> +			    pg.end);
>>> +
>>> +	for (i = 0; i <= 2; i++) {
>>> +		int npages = range_len(&r[i]);
>>> +
>>> +		if (npages > 0)
>>> +			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
>>> +	}
>>> +}
>>> +#endif /* CONFIG_CLEAR_PAGE_EXTENT */
>>> +
>>>    static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>>>    				   unsigned long addr_hint,
>>>    				   struct vm_area_struct *vma,
>>
>> So, folio_zero_user() is only compiled with THP | HUGETLB already.
>>
>> What we should probably do is scrap the whole new kconfig option and
>> do something like this in here:
> 
> So, in principle I don't disagree and unifying both of these is cleaner
> than introducing a whole new option.

Yes, after playing with the code, a new config option just for that is not
what we want.

> 
> However that still leaves this code having to contort around CONFIG_HIGHMEM
> which is probably even less frequently used than THP | HUGETLB.

Not sure I understand your question correctly, but thp+hugetlb are compatible with
32bit and highmem.

There are plans of removing highmem support, but that's a different story :)

I think as long as these configs exist, we should just support them, although
performance is a secondary concern.

> 
> Maybe we should get rid of ARCH_HAS_CLEAR_PAGES completely and everyone
> with !HIGHMEM either use a generic version of clear_pages() which loops
> and calls clear_page() or some architectural override.
> 
> And, then we can do a similar transformation with copy_pages() (and
> copy_user_large_folio()).
> 
> At that point, process_huge_page() is used only for !HIGHMEM configs

I assume you meant HIGHMEM

> configs which likely have relatively small caches and so that leaves
> it probably over-engineered.

I don't think we need to jump through hoops to optimize performance on
highmem, yes.

> 
> The thing that gives me pause is that non-x86 might perform worse
> when they switch away from the left-right zeroing approach in
> process_huge_page() to a generic clear_pages().

Right. Or they perform better. Hard to know.

> 
> So, maybe allowing architectures to opt in by having to define
> ARCH_HAS_CLEAR_PAGES would allow doing this in a more measured fashion.

One tricky thing is dealing with architectures where clear_user_highpage()
does cachemanagement.

So the more I think about it, I wonder if we really should just design it
all around clear_user_highpages and clear_user_pages, and have only a
single clearing algorithm.

Essentially, something like the following, just that we need a generic
clear_user_pages that iterates over clear_user_page.

Then, x86_64 could simply implement clear_user_pages by routing it to your
clear_pages, and define CLEAR_PAGES_RESCHED_NR (although I wonder if we can
do better here).



diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 6234f316468c9..031e19c56765b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -264,6 +264,14 @@ static inline void tag_clear_highpage(struct page *page)
  #ifdef CONFIG_HIGHMEM
  void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
  		unsigned start2, unsigned end2);
+static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
+		unsigned int nr_pages)
+{
+	unsigned int i;
+
+	for (i = 0; i <= nr_pages; i++)
+		clear_user_highpage(nth_page(page, i), vaddr + i * PAGE_SIZE);
+}
  #else
  static inline void zero_user_segments(struct page *page,
  		unsigned start1, unsigned end1,
@@ -284,6 +292,7 @@ static inline void zero_user_segments(struct page *page,
  	for (i = 0; i < compound_nr(page); i++)
  		flush_dcache_page(page + i);
  }
+#define clear_user_highpages clear_user_pages
  #endif
  
  static inline void zero_user_segment(struct page *page,
diff --git a/mm/memory.c b/mm/memory.c
index 3dd6c57e6511e..8aebf6e0765d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7009,40 +7009,92 @@ static inline int process_huge_page(
  	return 0;
  }
  
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
-				unsigned int nr_pages)
+#ifndef CLEAR_PAGES_RESCHED_NR
+#define CLEAR_PAGES_RESCHED_NR		1
+#endif /* CLEAR_PAGES_RESCHED_NR */
+
+static void clear_user_highpages_resched(struct page *page, unsigned long addr,
+		unsigned int nr_pages)
  {
-	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
-	int i;
+	unsigned int i, remaining;
  
-	might_sleep();
-	for (i = 0; i < nr_pages; i++) {
+	if (preempt_model_preemptible()) {
+		clear_user_highpages(page, addr, nr_pages);
+		goto out;
+	}
+
+	for (i = 0; i < nr_pages / CLEAR_PAGES_RESCHED_NR; i++) {
+		clear_user_highpages(nth_page(page, i * CLEAR_PAGES_RESCHED_NR),
+				     addr + i * CLEAR_PAGES_RESCHED_NR * PAGE_SIZE,
+				     CLEAR_PAGES_RESCHED_NR);
-		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
  		cond_resched();
  	}
-}
  
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
-	struct folio *folio = arg;
+	remaining = nr_pages % CLEAR_PAGES_RESCHED_NR;
  
-	clear_user_highpage(folio_page(folio, idx), addr);
-	return 0;
+	if (remaining)
+		clear_user_highpages(nth_page(page, i * CLEAR_PAGES_RESCHED_NR),
+				     addr + i * CLEAR_PAGES_RESCHED_NR * PAGE_SHIFT,
+				     remaining);
+out:
+	cond_resched();
  }
  
-/**
+/*
   * folio_zero_user - Zero a folio which will be mapped to userspace.
   * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
   */
  void folio_zero_user(struct folio *folio, unsigned long addr_hint)
  {
-	unsigned int nr_pages = folio_nr_pages(folio);
+	const unsigned int nr_pages = folio_nr_pages(folio);
+	const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
+	const long fault_idx = (addr_hint - addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, nr_pages - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
+
+	if (unlikely(nr_pages >= MAX_ORDER_NR_PAGES)) {
+		clear_user_highpages_resched(folio_page(folio, 0), addr, nr_pages);
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		unsigned int cur_nr_pages = range_len(&r[i]);
+		struct page *cur_page = folio_page(folio, r[i].start);
+		unsigned long cur_addr = addr + folio_page_idx(folio, cur_page) * PAGE_SIZE;
+
+		if (cur_nr_pages > 0)
+			clear_user_highpages_resched(cur_page, cur_addr, cur_nr_pages);
+	}
  
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
-	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
  }
  
  static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.50.1


On highmem we'd simply process individual pages, who cares.

On !highmem, we'd use the optimized clear_user_pages -> clear_pages implementation
if available. Otherwise, we clear individual pages.

Yes, we'd lose the left-right pattern.

If really important we could somehow let the architecture opt in and do the call
to the existing process function.


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v5 13/14] mm: memory: support clearing page-extents
  2025-07-16  8:03       ` David Hildenbrand
@ 2025-07-16 17:54         ` Ankur Arora
  0 siblings, 0 replies; 52+ messages in thread
From: Ankur Arora @ 2025-07-16 17:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand <david@redhat.com> writes:

> On 16.07.25 05:19, Ankur Arora wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 10.07.25 02:59, Ankur Arora wrote:
>>>> folio_zero_user() is constrained to clear in a page-at-a-time
>>>> fashion because it supports CONFIG_HIGHMEM which means that kernel
>>>> mappings for pages in a folio are not guaranteed to be contiguous.
>>>> We don't have this problem when running under configurations with
>>>> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
>>>> longer page-extents.
>>>> This is expected to be faster because the processor can now optimize
>>>> the clearing based on the knowledge of the extent.
>>>> However, clearing in larger chunks can have two other problems:
>>>>    - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>>>>      (larger folios don't have any expectation of cache locality).
>>>>    - preemption latency when clearing large folios.
>>>> Handle the first by splitting the clearing in three parts: the
>>>> faulting page and its immediate locality, its left and right
>>>> regions; the local neighbourhood is cleared last.
>>>> The second problem is relevant only when running under cooperative
>>>> preemption models. Limit the worst case preemption latency by clearing
>>>> in architecture specified ARCH_CLEAR_PAGE_EXTENT units.
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>>    mm/memory.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 85 insertions(+), 1 deletion(-)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index b0cda5aab398..c52806270375 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -7034,6 +7034,7 @@ static inline int process_huge_page(
>>>>    	return 0;
>>>>    }
>>>>    +#ifndef CONFIG_CLEAR_PAGE_EXTENT
>>>>    static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>>>>    				unsigned int nr_pages)
>>>>    {
>>>> @@ -7058,7 +7059,10 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
>>>>    /**
>>>>     * folio_zero_user - Zero a folio which will be mapped to userspace.
>>>>     * @folio: The folio to zero.
>>>> - * @addr_hint: The address will be accessed or the base address if uncelar.
>>>> + * @addr_hint: The address accessed by the user or the base address.
>>>> + *
>>>> + * folio_zero_user() uses clear_gigantic_page() or process_huge_page() to
>>>> + * do page-at-a-time zeroing because it needs to handle CONFIG_HIGHMEM.
>>>>     */
>>>>    void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>>    {
>>>> @@ -7070,6 +7074,86 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>>    		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>>>>    }
>>>>    +#else /* CONFIG_CLEAR_PAGE_EXTENT */
>>>> +
>>>> +static void clear_pages_resched(void *addr, int npages)
>>>> +{
>>>> +	int i, remaining;
>>>> +
>>>> +	if (preempt_model_preemptible()) {
>>>> +		clear_pages(addr, npages);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
>>>> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
>>>> +			    ARCH_CLEAR_PAGE_EXTENT);
>>>> +		cond_resched();
>>>> +	}
>>>> +
>>>> +	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
>>>> +
>>>> +	if (remaining)
>>>> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
>>>> +			    remaining);
>>>> +out:
>>>> +	cond_resched();
>>>> +}
>>>> +
>>>> +/*
>>>> + * folio_zero_user - Zero a folio which will be mapped to userspace.
>>>> + * @folio: The folio to zero.
>>>> + * @addr_hint: The address accessed by the user or the base address.
>>>> + *
>>>> + * Uses architectural support for clear_pages() to zero page extents
>>>> + * instead of clearing page-at-a-time.
>>>> + *
>>>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>>>> + * pages in the immediate locality of the faulting page, and its left, right
>>>> + * regions; the local neighbourhood cleared last in order to keep cache
>>>> + * lines of the target region hot.
>>>> + *
>>>> + * For larger folios we assume that there is no expectation of cache locality
>>>> + * and just do a straight zero.
>>>> + */
>>>> +void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>> +{
>>>> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>>>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>>>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>>>> +	const int width = 2; /* number of pages cleared last on either side */
>>>> +	struct range r[3];
>>>> +	int i;
>>>> +
>>>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>>>> +		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>>>> +	 * ensure it sticks around in the cache.
>>>> +	 */
>>>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>>>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>>>> +
>>>> +	/* Region to the left of the fault */
>>>> +	r[1] = DEFINE_RANGE(pg.start,
>>>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>>>> +
>>>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>>>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>>>> +			    pg.end);
>>>> +
>>>> +	for (i = 0; i <= 2; i++) {
>>>> +		int npages = range_len(&r[i]);
>>>> +
>>>> +		if (npages > 0)
>>>> +			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
>>>> +	}
>>>> +}
>>>> +#endif /* CONFIG_CLEAR_PAGE_EXTENT */
>>>> +
>>>>    static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>>>>    				   unsigned long addr_hint,
>>>>    				   struct vm_area_struct *vma,
>>>
>>> So, folio_zero_user() is only compiled with THP | HUGETLB already.
>>>
>>> What we should probably do is scrap the whole new kconfig option and
>>> do something like this in here:
>> So, in principle I don't disagree and unifying both of these is cleaner
>> than introducing a whole new option.
>
> Yes, after playing with the code, a new config option just for that is not
> what we want.
>
>> However that still leaves this code having to contort around CONFIG_HIGHMEM
>> which is probably even less frequently used than THP | HUGETLB.
>
> Not sure I understand your question correctly, but thp+hugetlb are compatible with
> 32bit and highmem.
>
> There are plans of removing highmem support, but that's a different story :)

Oh that would be a godsend! Whenever that happens.

> I think as long as these configs exist, we should just support them, although
> performance is a secondary concern.
>
>> Maybe we should get rid of ARCH_HAS_CLEAR_PAGES completely and everyone
>> with !HIGHMEM either use a generic version of clear_pages() which loops
>> and calls clear_page() or some architectural override.
>> And, then we can do a similar transformation with copy_pages() (and
>> copy_user_large_folio()).
>> At that point, process_huge_page() is used only for !HIGHMEM configs
>
> I assume you meant HIGHMEM

Oh yeah.

>> configs which likely have relatively small caches and so that leaves
>> it probably over-engineered.
>
> I don't think we need to jump through hoops to optimize performance on
> highmem, yes.
>
>> The thing that gives me pause is that non-x86 might perform worse
>> when they switch away from the left-right zeroing approach in
>> process_huge_page() to a generic clear_pages().
>
> Right. Or they perform better. Hard to know.
>
>> So, maybe allowing architectures to opt in by having to define
>> ARCH_HAS_CLEAR_PAGES would allow doing this in a more measured fashion.
>
> One tricky thing is dealing with architectures where clear_user_highpage()
> does cachemanagement.

Oh yeah, I was forgetting that.

> So the more I think about it, I wonder if we really should just design it
> all around clear_user_highpages and clear_user_pages, and have only a
> single clearing algorithm.

Great. This is exactly what I was hoping to eventually get to.

> Essentially, something like the following, just that we need a generic
> clear_user_pages that iterates over clear_user_page.
>
> Then, x86_64 could simply implement clear_user_pages by routing it to your
> clear_pages, and define CLEAR_PAGES_RESCHED_NR (although I wonder if we can
> do better here).

Agreed.

So, essentially just have the lower layer interfaces in place (generic
and arch specific where needed):
 clear_pages()
 clear_user_pages()
 clear_user_highpages()

With the arch defining whichever of those it needs (and ARCH_CLEAR_PAGES_RESCHED_NR).

And, a folio_zero_user() pretty much as below.

> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index 6234f316468c9..031e19c56765b 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -264,6 +264,14 @@ static inline void tag_clear_highpage(struct page *page)
>  #ifdef CONFIG_HIGHMEM
>  void zero_user_segments(struct page *page, unsigned start1, unsigned end1,
>  		unsigned start2, unsigned end2);
> +static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
> +		unsigned int nr_pages)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i <= nr_pages; i++)
> +		clear_user_highpage(nth_page(page, i), vaddr + i * PAGE_SIZE);
> +}
>  #else
>  static inline void zero_user_segments(struct page *page,
>  		unsigned start1, unsigned end1,
> @@ -284,6 +292,7 @@ static inline void zero_user_segments(struct page *page,
>  	for (i = 0; i < compound_nr(page); i++)
>  		flush_dcache_page(page + i);
>  }
> +#define clear_user_highpages clear_user_pages
>  #endif
>    static inline void zero_user_segment(struct page *page,
> diff --git a/mm/memory.c b/mm/memory.c
> index 3dd6c57e6511e..8aebf6e0765d8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7009,40 +7009,92 @@ static inline int process_huge_page(
>  	return 0;
>  }
>  -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
> -				unsigned int nr_pages)
> +#ifndef CLEAR_PAGES_RESCHED_NR
> +#define CLEAR_PAGES_RESCHED_NR		1
> +#endif /* CLEAR_PAGES_RESCHED_NR */
> +
> +static void clear_user_highpages_resched(struct page *page, unsigned long addr,
> +		unsigned int nr_pages)
>  {
> -	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> -	int i;
> +	unsigned int i, remaining;
>  -	might_sleep();
> -	for (i = 0; i < nr_pages; i++) {
> +	if (preempt_model_preemptible()) {
> +		clear_user_highpages(page, addr, nr_pages);
> +		goto out;
> +	}
> +
> +	for (i = 0; i < nr_pages / CLEAR_PAGES_RESCHED_NR; i++) {
> +		clear_user_highpages(nth_page(page, i * CLEAR_PAGES_RESCHED_NR),
> +				     addr + i * CLEAR_PAGES_RESCHED_NR * PAGE_SIZE,
> +				     CLEAR_PAGES_RESCHED_NR);
> -		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>  		cond_resched();
>  	}
> -}
>  -static int clear_subpage(unsigned long addr, int idx, void *arg)
> -{
> -	struct folio *folio = arg;
> +	remaining = nr_pages % CLEAR_PAGES_RESCHED_NR;
>  -	clear_user_highpage(folio_page(folio, idx), addr);
> -	return 0;
> +	if (remaining)
> +		clear_user_highpages(nth_page(page, i * CLEAR_PAGES_RESCHED_NR),
> +				     addr + i * CLEAR_PAGES_RESCHED_NR * PAGE_SHIFT,
> +				     remaining);
> +out:
> +	cond_resched();
>  }
>  -/**
> +/*
>   * folio_zero_user - Zero a folio which will be mapped to userspace.
>   * @folio: The folio to zero.
> - * @addr_hint: The address will be accessed or the base address if uncelar.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * Uses architectural support for clear_pages() to zero page extents
> + * instead of clearing page-at-a-time.
> + *
> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
> + * pages in the immediate locality of the faulting page, and its left, right
> + * regions; the local neighbourhood cleared last in order to keep cache
> + * lines of the target region hot.
> + *
> + * For larger folios we assume that there is no expectation of cache locality
> + * and just do a straight zero.
>   */
>  void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>  {
> -	unsigned int nr_pages = folio_nr_pages(folio);
> +	const unsigned int nr_pages = folio_nr_pages(folio);
> +	const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
> +	const long fault_idx = (addr_hint - addr) / PAGE_SIZE;
> +	const struct range pg = DEFINE_RANGE(0, nr_pages - 1);
> +	const int width = 2; /* number of pages cleared last on either side */
> +	struct range r[3];
> +	int i;
> +
> +	if (unlikely(nr_pages >= MAX_ORDER_NR_PAGES)) {
> +		clear_user_highpages_resched(folio_page(folio, 0), addr, nr_pages);
> +		return;
> +	}
> +
> +	/*
> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
> +	 * ensure it sticks around in the cache.
> +	 */
> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
> +
> +	/* Region to the left of the fault */
> +	r[1] = DEFINE_RANGE(pg.start,
> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
> +
> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> +			    pg.end);
> +
> +	for (i = 0; i <= 2; i++) {
> +		unsigned int cur_nr_pages = range_len(&r[i]);
> +		struct page *cur_page = folio_page(folio, r[i].start);
> +		unsigned long cur_addr = addr + folio_page_idx(folio, cur_page) * PAGE_SIZE;
> +
> +		if (cur_nr_pages > 0)
> +			clear_user_highpages_resched(cur_page, cur_addr, cur_nr_pages);
> +	}
>  -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> -		clear_gigantic_page(folio, addr_hint, nr_pages);
> -	else
> -		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>  }
>    static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
> --
> 2.50.1
>
>
> On highmem we'd simply process individual pages, who cares.
>
> On !highmem, we'd use the optimized clear_user_pages -> clear_pages implementation
> if available. Otherwise, we clear individual pages.
>
> Yes, we'd lose the left-right pattern.
>
> If really important we could somehow let the architecture opt in and do the call
> to the existing process function.

Great. Alright let me work on this.

And, thanks for the very helpful comments.

--
ankur

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2025-07-16 17:55 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10  0:59 [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents Ankur Arora
2025-07-10  0:59 ` [PATCH v5 01/14] perf bench mem: Remove repetition around time measurement Ankur Arora
2025-07-15 20:04   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 02/14] perf bench mem: Defer type munging of size to float Ankur Arora
2025-07-15 20:05   ` Namhyung Kim
2025-07-16  2:17     ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 03/14] perf bench mem: Move mem op parameters into a structure Ankur Arora
2025-07-15 20:06   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 04/14] perf bench mem: Pull out init/fini logic Ankur Arora
2025-07-15 20:09   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 05/14] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
2025-07-15 20:09   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 06/14] perf bench mem: Allow mapping of hugepages Ankur Arora
2025-07-15 20:12   ` Namhyung Kim
2025-07-16  2:32     ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 07/14] perf bench mem: Allow chunking on a memory region Ankur Arora
2025-07-15 20:17   ` Namhyung Kim
2025-07-16  2:34     ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 08/14] perf bench mem: Refactor mem_options Ankur Arora
2025-07-15 20:18   ` Namhyung Kim
2025-07-10  0:59 ` [PATCH v5 09/14] perf bench mem: Add mmap() workloads Ankur Arora
2025-07-15 20:20   ` Namhyung Kim
2025-07-16  2:40     ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 10/14] x86/mm: Simplify clear_page_* Ankur Arora
2025-07-11 11:47   ` David Hildenbrand
2025-07-11 17:26     ` Ankur Arora
2025-07-11 19:03       ` David Hildenbrand
2025-07-11 19:24         ` Ankur Arora
2025-07-11 19:27           ` David Hildenbrand
2025-07-10  0:59 ` [PATCH v5 11/14] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-07-10  0:59 ` [PATCH v5 12/14] mm: add config option for clearing page-extents Ankur Arora
2025-07-10  7:58   ` Andrew Morton
2025-07-10 16:31     ` Ankur Arora
2025-07-11 11:39   ` David Hildenbrand
2025-07-11 17:25     ` Ankur Arora
2025-07-11 19:14       ` David Hildenbrand
2025-07-11 19:35         ` Ankur Arora
2025-07-11 11:40   ` David Hildenbrand
2025-07-11 17:32     ` Ankur Arora
2025-07-11 19:26       ` David Hildenbrand
2025-07-11 19:42         ` Ankur Arora
2025-07-14 20:35         ` Ankur Arora
2025-07-15 20:59           ` David Hildenbrand
2025-07-10  0:59 ` [PATCH v5 13/14] mm: memory: support " Ankur Arora
2025-07-11 11:44   ` David Hildenbrand
2025-07-11 13:27     ` Raghavendra K T
2025-07-11 17:39     ` Ankur Arora
2025-07-15 22:08   ` David Hildenbrand
2025-07-16  3:19     ` Ankur Arora
2025-07-16  8:03       ` David Hildenbrand
2025-07-16 17:54         ` Ankur Arora
2025-07-10  0:59 ` [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).