linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/13] x86/mm: Add multi-page clearing
@ 2025-06-16  5:22 Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 01/13] perf bench mem: Remove repetition around time measurement Ankur Arora
                   ` (14 more replies)
  0 siblings, 15 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

This series adds multi-page clearing for hugepages, improving on the
current page-at-a-time approach in two ways:

 - amortizes the per-page setup cost over a larger extent
 - when using string instructions, exposes the real region size to the
   processor. A processor could use that as a hint to optimize based
   on the full extent size. AMD Zen uarchs, as an example, elide
   allocation of cachelines for regions larger than L3-size.

Demand faulting a 64GB region shows good performance improvements:

 $ perf bench mem map -p $page-size -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

  pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%
  pg-sz=1GB       17.51  +- 1.19%        40.03  +-  7.26% [#]   +129.9%

[#] Only with preempt=full|lazy because cooperatively preempted models
need regular invocations of cond_resched(). This limits the extent
sizes that can be cleared as a unit.

Raghavendra also tested on AMD Genoa and that shows similar
improvements [1].

Series structure:

Patches 1-5, 8,
  "perf bench mem: Remove repetition around time measurement"
  "perf bench mem: Defer type munging of size to float"
  "perf bench mem: Move mem op parameters into a structure"
  "perf bench mem: Pull out init/fini logic"
  "perf bench mem: Switch from zalloc() to mmap()"
  "perf bench mem: Refactor mem_options"

refactor, and patches 6-7, 9
  "perf bench mem: Allow mapping of hugepages"
  "perf bench mem: Allow chunking on a memory region"
  "perf bench mem: Add mmap() workload"

add a few new perf bench mem workloads (chunking and mapping performance).

Patches 10-11,
  "x86/mm: Simplify clear_page_*"
  "x86/clear_page: Introduce clear_pages()"

inlines the ERMS and REP_GOOD implementations used from clear_page()
and adds clear_pages() to handle page extents.

And finally patches 12-13, allow an arch override to folio_zero_user()
and provide the x86 implementation that can do the actual multi-page
clearing.

  "mm: memory: allow arch override for folio_zero_user()"
  "x86/folio_zero_user: Add multi-page clearing"

Changelog:

v4:
 - adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
 - inline stosb etc (PeterZ)
 - handle cooperative preemption models (Ingo)
 - interface and other cleanups all over (Ingo)

v3:
 - get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
   was limited to preempt=full|lazy.
 - override folio_zero_user() (Linus)
 (https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)

v2:
 - addressed review comments from peterz, tglx.
 - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
 - General code cleanup
 (https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages.v4

[1] https://lore.kernel.org/lkml/0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com/

Ankur Arora (13):
  perf bench mem: Remove repetition around time measurement
  perf bench mem: Defer type munging of size to float
  perf bench mem: Move mem op parameters into a structure
  perf bench mem: Pull out init/fini logic
  perf bench mem: Switch from zalloc() to mmap()
  perf bench mem: Allow mapping of hugepages
  perf bench mem: Allow chunking on a memory region
  perf bench mem: Refactor mem_options
  perf bench mem: Add mmap() workload
  x86/mm: Simplify clear_page_*
  x86/clear_page: Introduce clear_pages()
  mm: memory: allow arch override for folio_zero_user()
  x86/folio_zero_user: Add multi-page clearing

 arch/x86/include/asm/page_32.h               |  18 +-
 arch/x86/include/asm/page_64.h               |  38 +-
 arch/x86/lib/clear_page_64.S                 |  39 +-
 arch/x86/mm/Makefile                         |   1 +
 arch/x86/mm/memory.c                         |  97 +++++
 mm/memory.c                                  |   5 +-
 tools/perf/bench/bench.h                     |   1 +
 tools/perf/bench/mem-functions.c             | 391 ++++++++++++++-----
 tools/perf/bench/mem-memcpy-arch.h           |   2 +-
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |   4 +
 tools/perf/bench/mem-memset-arch.h           |   2 +-
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   4 +
 tools/perf/builtin-bench.c                   |   1 +
 13 files changed, 452 insertions(+), 151 deletions(-)
 create mode 100644 arch/x86/mm/memory.c

--
2.43.5


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 01/13] perf bench mem: Remove repetition around time measurement
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 02/13] perf bench mem: Defer type munging of size to float Ankur Arora
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

We have two copies of each mem benchmark: one using cycles to
measure time, the second for gettimeofday().

Unify.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 110 +++++++++++++------------------
 1 file changed, 46 insertions(+), 64 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 19d45c377ac1..8599ed96ee1f 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -51,6 +51,11 @@ static const struct option options[] = {
 	OPT_END()
 };
 
+union bench_clock {
+	u64		cycles;
+	struct timeval	tv;
+};
+
 typedef void *(*memcpy_t)(void *, const void *, size_t);
 typedef void *(*memset_t)(void *, int, size_t);
 
@@ -91,6 +96,26 @@ static u64 get_cycles(void)
 	return clk;
 }
 
+static void clock_get(union bench_clock *t)
+{
+	if (use_cycles)
+		t->cycles = get_cycles();
+	else
+		BUG_ON(gettimeofday(&t->tv, NULL));
+}
+
+static union bench_clock clock_diff(union bench_clock *s, union bench_clock *e)
+{
+	union bench_clock t;
+
+	if (use_cycles)
+		t.cycles = e->cycles - s->cycles;
+	else
+		timersub(&e->tv, &s->tv, &t.tv);
+
+	return t;
+}
+
 static double timeval2double(struct timeval *ts)
 {
 	return (double)ts->tv_sec + (double)ts->tv_usec / (double)USEC_PER_SEC;
@@ -109,8 +134,7 @@ static double timeval2double(struct timeval *ts)
 
 struct bench_mem_info {
 	const struct function *functions;
-	u64 (*do_cycles)(const struct function *r, size_t size, void *src, void *dst);
-	double (*do_gettimeofday)(const struct function *r, size_t size, void *src, void *dst);
+	union bench_clock (*do_op)(const struct function *r, size_t size, void *src, void *dst);
 	const char *const *usage;
 	bool alloc_src;
 };
@@ -119,7 +143,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 {
 	const struct function *r = &info->functions[r_idx];
 	double result_bps = 0.0;
-	u64 result_cycles = 0;
+	union bench_clock rt = { 0 };
 	void *src = NULL, *dst = zalloc(size);
 
 	printf("# function '%s' (%s)\n", r->name, r->desc);
@@ -136,25 +160,23 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 	if (bench_format == BENCH_FORMAT_DEFAULT)
 		printf("# Copying %s bytes ...\n\n", size_str);
 
-	if (use_cycles) {
-		result_cycles = info->do_cycles(r, size, src, dst);
-	} else {
-		result_bps = info->do_gettimeofday(r, size, src, dst);
-	}
+	rt = info->do_op(r, size, src, dst);
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
 		if (use_cycles) {
-			printf(" %14lf cycles/byte\n", (double)result_cycles/size_total);
+			printf(" %14lf cycles/byte\n", (double)rt.cycles/size_total);
 		} else {
+			result_bps = size_total/timeval2double(&rt.tv);
 			print_bps(result_bps);
 		}
 		break;
 
 	case BENCH_FORMAT_SIMPLE:
 		if (use_cycles) {
-			printf("%lf\n", (double)result_cycles/size_total);
+			printf("%lf\n", (double)rt.cycles/size_total);
 		} else {
+			result_bps = size_total/timeval2double(&rt.tv);
 			printf("%lf\n", result_bps);
 		}
 		break;
@@ -235,38 +257,21 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
 	fn(dst, src, size);
 }
 
-static u64 do_memcpy_cycles(const struct function *r, size_t size, void *src, void *dst)
+static union bench_clock do_memcpy(const struct function *r, size_t size,
+				   void *src, void *dst)
 {
-	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	union bench_clock start, end;
 	memcpy_t fn = r->fn.memcpy;
 	int i;
 
 	memcpy_prefault(fn, size, src, dst);
 
-	cycle_start = get_cycles();
+	clock_get(&start);
 	for (i = 0; i < nr_loops; ++i)
 		fn(dst, src, size);
-	cycle_end = get_cycles();
+	clock_get(&end);
 
-	return cycle_end - cycle_start;
-}
-
-static double do_memcpy_gettimeofday(const struct function *r, size_t size, void *src, void *dst)
-{
-	struct timeval tv_start, tv_end, tv_diff;
-	memcpy_t fn = r->fn.memcpy;
-	int i;
-
-	memcpy_prefault(fn, size, src, dst);
-
-	BUG_ON(gettimeofday(&tv_start, NULL));
-	for (i = 0; i < nr_loops; ++i)
-		fn(dst, src, size);
-	BUG_ON(gettimeofday(&tv_end, NULL));
-
-	timersub(&tv_end, &tv_start, &tv_diff);
-
-	return (double)(((double)size * nr_loops) / timeval2double(&tv_diff));
+	return clock_diff(&start, &end);
 }
 
 struct function memcpy_functions[] = {
@@ -292,8 +297,7 @@ int bench_mem_memcpy(int argc, const char **argv)
 {
 	struct bench_mem_info info = {
 		.functions		= memcpy_functions,
-		.do_cycles		= do_memcpy_cycles,
-		.do_gettimeofday	= do_memcpy_gettimeofday,
+		.do_op			= do_memcpy,
 		.usage			= bench_mem_memcpy_usage,
 		.alloc_src              = true,
 	};
@@ -301,9 +305,10 @@ int bench_mem_memcpy(int argc, const char **argv)
 	return bench_mem_common(argc, argv, &info);
 }
 
-static u64 do_memset_cycles(const struct function *r, size_t size, void *src __maybe_unused, void *dst)
+static union bench_clock do_memset(const struct function *r, size_t size,
+				   void *src __maybe_unused, void *dst)
 {
-	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	union bench_clock start, end;
 	memset_t fn = r->fn.memset;
 	int i;
 
@@ -313,34 +318,12 @@ static u64 do_memset_cycles(const struct function *r, size_t size, void *src __m
 	 */
 	fn(dst, -1, size);
 
-	cycle_start = get_cycles();
+	clock_get(&start);
 	for (i = 0; i < nr_loops; ++i)
 		fn(dst, i, size);
-	cycle_end = get_cycles();
+	clock_get(&end);
 
-	return cycle_end - cycle_start;
-}
-
-static double do_memset_gettimeofday(const struct function *r, size_t size, void *src __maybe_unused, void *dst)
-{
-	struct timeval tv_start, tv_end, tv_diff;
-	memset_t fn = r->fn.memset;
-	int i;
-
-	/*
-	 * We prefault the freshly allocated memory range here,
-	 * to not measure page fault overhead:
-	 */
-	fn(dst, -1, size);
-
-	BUG_ON(gettimeofday(&tv_start, NULL));
-	for (i = 0; i < nr_loops; ++i)
-		fn(dst, i, size);
-	BUG_ON(gettimeofday(&tv_end, NULL));
-
-	timersub(&tv_end, &tv_start, &tv_diff);
-
-	return (double)(((double)size * nr_loops) / timeval2double(&tv_diff));
+	return clock_diff(&start, &end);
 }
 
 static const char * const bench_mem_memset_usage[] = {
@@ -366,8 +349,7 @@ int bench_mem_memset(int argc, const char **argv)
 {
 	struct bench_mem_info info = {
 		.functions		= memset_functions,
-		.do_cycles		= do_memset_cycles,
-		.do_gettimeofday	= do_memset_gettimeofday,
+		.do_op			= do_memset,
 		.usage			= bench_mem_memset_usage,
 	};
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 02/13] perf bench mem: Defer type munging of size to float
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 01/13] perf bench mem: Remove repetition around time measurement Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 03/13] perf bench mem: Move mem op parameters into a structure Ankur Arora
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Do type conversion to double at the point of use.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 8599ed96ee1f..b8f020379197 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -139,7 +139,7 @@ struct bench_mem_info {
 	bool alloc_src;
 };
 
-static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, double size_total)
+static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, size_t size_total)
 {
 	const struct function *r = &info->functions[r_idx];
 	double result_bps = 0.0;
@@ -165,18 +165,18 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
 		if (use_cycles) {
-			printf(" %14lf cycles/byte\n", (double)rt.cycles/size_total);
+			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)size_total);
 		} else {
-			result_bps = size_total/timeval2double(&rt.tv);
+			result_bps = (double)size_total/timeval2double(&rt.tv);
 			print_bps(result_bps);
 		}
 		break;
 
 	case BENCH_FORMAT_SIMPLE:
 		if (use_cycles) {
-			printf("%lf\n", (double)rt.cycles/size_total);
+			printf("%lf\n", (double)rt.cycles/(double)size_total);
 		} else {
-			result_bps = size_total/timeval2double(&rt.tv);
+			result_bps = (double)size_total/timeval2double(&rt.tv);
 			printf("%lf\n", result_bps);
 		}
 		break;
@@ -199,7 +199,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 {
 	int i;
 	size_t size;
-	double size_total;
+	size_t size_total;
 
 	argc = parse_options(argc, argv, options, info->usage, 0);
 
@@ -212,7 +212,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	}
 
 	size = (size_t)perf_atoll((char *)size_str);
-	size_total = (double)size * nr_loops;
+	size_total = (size_t)size * nr_loops;
 
 	if ((s64)size <= 0) {
 		fprintf(stderr, "Invalid size:%s\n", size_str);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 03/13] perf bench mem: Move mem op parameters into a structure
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 01/13] perf bench mem: Remove repetition around time measurement Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 02/13] perf bench mem: Defer type munging of size to float Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 04/13] perf bench mem: Pull out init/fini logic Ankur Arora
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Move benchmark function parameters inside struct bench_params.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 63 +++++++++++++++++---------------
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index b8f020379197..fb17d36a6f6c 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -30,7 +30,7 @@
 
 static const char	*size_str	= "1MB";
 static const char	*function_str	= "all";
-static int		nr_loops	= 1;
+static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
 
@@ -42,7 +42,7 @@ static const struct option options[] = {
 	OPT_STRING('f', "function", &function_str, "all",
 		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
 
-	OPT_INTEGER('l', "nr_loops", &nr_loops,
+	OPT_UINTEGER('l', "nr_loops", &nr_loops,
 		    "Specify the number of loops to run. (default: 1)"),
 
 	OPT_BOOLEAN('c', "cycles", &use_cycles,
@@ -56,6 +56,12 @@ union bench_clock {
 	struct timeval	tv;
 };
 
+struct bench_params {
+	size_t		size;
+	size_t		size_total;
+	unsigned int	nr_loops;
+};
+
 typedef void *(*memcpy_t)(void *, const void *, size_t);
 typedef void *(*memset_t)(void *, int, size_t);
 
@@ -134,17 +140,19 @@ static double timeval2double(struct timeval *ts)
 
 struct bench_mem_info {
 	const struct function *functions;
-	union bench_clock (*do_op)(const struct function *r, size_t size, void *src, void *dst);
+	union bench_clock (*do_op)(const struct function *r, struct bench_params *p,
+				   void *src, void *dst);
 	const char *const *usage;
 	bool alloc_src;
 };
 
-static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t size, size_t size_total)
+static void __bench_mem_function(struct bench_mem_info *info, struct bench_params *p,
+				 int r_idx)
 {
 	const struct function *r = &info->functions[r_idx];
 	double result_bps = 0.0;
 	union bench_clock rt = { 0 };
-	void *src = NULL, *dst = zalloc(size);
+	void *src = NULL, *dst = zalloc(p->size);
 
 	printf("# function '%s' (%s)\n", r->name, r->desc);
 
@@ -152,7 +160,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 		goto out_alloc_failed;
 
 	if (info->alloc_src) {
-		src = zalloc(size);
+		src = zalloc(p->size);
 		if (src == NULL)
 			goto out_alloc_failed;
 	}
@@ -160,23 +168,23 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 	if (bench_format == BENCH_FORMAT_DEFAULT)
 		printf("# Copying %s bytes ...\n\n", size_str);
 
-	rt = info->do_op(r, size, src, dst);
+	rt = info->do_op(r, p, src, dst);
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
 		if (use_cycles) {
-			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)size_total);
+			printf(" %14lf cycles/byte\n", (double)rt.cycles/(double)p->size_total);
 		} else {
-			result_bps = (double)size_total/timeval2double(&rt.tv);
+			result_bps = (double)p->size_total/timeval2double(&rt.tv);
 			print_bps(result_bps);
 		}
 		break;
 
 	case BENCH_FORMAT_SIMPLE:
 		if (use_cycles) {
-			printf("%lf\n", (double)rt.cycles/(double)size_total);
+			printf("%lf\n", (double)rt.cycles/(double)p->size_total);
 		} else {
-			result_bps = (double)size_total/timeval2double(&rt.tv);
+			result_bps = (double)p->size_total/timeval2double(&rt.tv);
 			printf("%lf\n", result_bps);
 		}
 		break;
@@ -198,8 +206,7 @@ static void __bench_mem_function(struct bench_mem_info *info, int r_idx, size_t
 static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *info)
 {
 	int i;
-	size_t size;
-	size_t size_total;
+	struct bench_params p = { 0 };
 
 	argc = parse_options(argc, argv, options, info->usage, 0);
 
@@ -211,17 +218,17 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 		}
 	}
 
-	size = (size_t)perf_atoll((char *)size_str);
-	size_total = (size_t)size * nr_loops;
-
-	if ((s64)size <= 0) {
+	p.nr_loops = nr_loops;
+	p.size = (size_t)perf_atoll((char *)size_str);
+	if ((s64)p.size <= 0) {
 		fprintf(stderr, "Invalid size:%s\n", size_str);
 		return 1;
 	}
+	p.size_total = (size_t)p.size * p.nr_loops;
 
 	if (!strncmp(function_str, "all", 3)) {
 		for (i = 0; info->functions[i].name; i++)
-			__bench_mem_function(info, i, size, size_total);
+			__bench_mem_function(info, &p, i);
 		return 0;
 	}
 
@@ -240,7 +247,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 		return 1;
 	}
 
-	__bench_mem_function(info, i, size, size_total);
+	__bench_mem_function(info, &p, i);
 
 	return 0;
 }
@@ -257,18 +264,17 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
 	fn(dst, src, size);
 }
 
-static union bench_clock do_memcpy(const struct function *r, size_t size,
+static union bench_clock do_memcpy(const struct function *r, struct bench_params *p,
 				   void *src, void *dst)
 {
 	union bench_clock start, end;
 	memcpy_t fn = r->fn.memcpy;
-	int i;
 
-	memcpy_prefault(fn, size, src, dst);
+	memcpy_prefault(fn, p->size, src, dst);
 
 	clock_get(&start);
-	for (i = 0; i < nr_loops; ++i)
-		fn(dst, src, size);
+	for (unsigned int i = 0; i < p->nr_loops; ++i)
+		fn(dst, src, p->size);
 	clock_get(&end);
 
 	return clock_diff(&start, &end);
@@ -305,22 +311,21 @@ int bench_mem_memcpy(int argc, const char **argv)
 	return bench_mem_common(argc, argv, &info);
 }
 
-static union bench_clock do_memset(const struct function *r, size_t size,
+static union bench_clock do_memset(const struct function *r, struct bench_params *p,
 				   void *src __maybe_unused, void *dst)
 {
 	union bench_clock start, end;
 	memset_t fn = r->fn.memset;
-	int i;
 
 	/*
 	 * We prefault the freshly allocated memory range here,
 	 * to not measure page fault overhead:
 	 */
-	fn(dst, -1, size);
+	fn(dst, -1, p->size);
 
 	clock_get(&start);
-	for (i = 0; i < nr_loops; ++i)
-		fn(dst, i, size);
+	for (unsigned int i = 0; i < p->nr_loops; ++i)
+		fn(dst, i, p->size);
 	clock_get(&end);
 
 	return clock_diff(&start, &end);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 04/13] perf bench mem: Pull out init/fini logic
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (2 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 03/13] perf bench mem: Move mem op parameters into a structure Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 05/13] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

No functional change.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c             | 103 +++++++++++++------
 tools/perf/bench/mem-memcpy-arch.h           |   2 +-
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |   4 +
 tools/perf/bench/mem-memset-arch.h           |   2 +-
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   4 +
 5 files changed, 81 insertions(+), 34 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index fb17d36a6f6c..06d3ee6f5d69 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -62,15 +62,31 @@ struct bench_params {
 	unsigned int	nr_loops;
 };
 
+struct bench_mem_info {
+	const struct function *functions;
+	int (*do_op)(const struct function *r, struct bench_params *p,
+		     void *src, void *dst, union bench_clock *rt);
+	const char *const *usage;
+	bool alloc_src;
+};
+
+typedef bool (*mem_init_t)(struct bench_mem_info *, struct bench_params *,
+			   void **, void **);
+typedef void (*mem_fini_t)(struct bench_mem_info *, struct bench_params *,
+			   void **, void **);
 typedef void *(*memcpy_t)(void *, const void *, size_t);
 typedef void *(*memset_t)(void *, int, size_t);
 
 struct function {
 	const char *name;
 	const char *desc;
-	union {
-		memcpy_t memcpy;
-		memset_t memset;
+	struct {
+		mem_init_t init;
+		mem_fini_t fini;
+		union {
+			memcpy_t memcpy;
+			memset_t memset;
+		};
 	} fn;
 };
 
@@ -138,37 +154,24 @@ static double timeval2double(struct timeval *ts)
 			printf(" %14lf GB/sec\n", x / K / K / K);	\
 	} while (0)
 
-struct bench_mem_info {
-	const struct function *functions;
-	union bench_clock (*do_op)(const struct function *r, struct bench_params *p,
-				   void *src, void *dst);
-	const char *const *usage;
-	bool alloc_src;
-};
-
 static void __bench_mem_function(struct bench_mem_info *info, struct bench_params *p,
 				 int r_idx)
 {
 	const struct function *r = &info->functions[r_idx];
 	double result_bps = 0.0;
 	union bench_clock rt = { 0 };
-	void *src = NULL, *dst = zalloc(p->size);
+	void *src = NULL, *dst = NULL;
 
 	printf("# function '%s' (%s)\n", r->name, r->desc);
 
-	if (dst == NULL)
-		goto out_alloc_failed;
-
-	if (info->alloc_src) {
-		src = zalloc(p->size);
-		if (src == NULL)
-			goto out_alloc_failed;
-	}
+	if (r->fn.init && r->fn.init(info, p, &src, &dst))
+		goto out_init_failed;
 
 	if (bench_format == BENCH_FORMAT_DEFAULT)
 		printf("# Copying %s bytes ...\n\n", size_str);
 
-	rt = info->do_op(r, p, src, dst);
+	if (info->do_op(r, p, src, dst, &rt))
+		goto out_test_failed;
 
 	switch (bench_format) {
 	case BENCH_FORMAT_DEFAULT:
@@ -194,11 +197,11 @@ static void __bench_mem_function(struct bench_mem_info *info, struct bench_param
 		break;
 	}
 
+out_test_failed:
 out_free:
-	free(src);
-	free(dst);
+	if (r->fn.fini) r->fn.fini(info, p, &src, &dst);
 	return;
-out_alloc_failed:
+out_init_failed:
 	printf("# Memory allocation failed - maybe size (%s) is too large?\n", size_str);
 	goto out_free;
 }
@@ -264,8 +267,8 @@ static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst)
 	fn(dst, src, size);
 }
 
-static union bench_clock do_memcpy(const struct function *r, struct bench_params *p,
-				   void *src, void *dst)
+static int do_memcpy(const struct function *r, struct bench_params *p,
+		     void *src, void *dst, union bench_clock *rt)
 {
 	union bench_clock start, end;
 	memcpy_t fn = r->fn.memcpy;
@@ -277,16 +280,47 @@ static union bench_clock do_memcpy(const struct function *r, struct bench_params
 		fn(dst, src, p->size);
 	clock_get(&end);
 
-	return clock_diff(&start, &end);
+	*rt = clock_diff(&start, &end);
+
+	return 0;
+}
+
+static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
+		      void **src, void **dst)
+{
+	bool failed;
+
+	*dst = zalloc(p->size);
+	failed = *dst == NULL;
+
+	if (info->alloc_src) {
+		*src = zalloc(p->size);
+		failed = failed || *src == NULL;
+	}
+
+	return failed;
+}
+
+static void mem_free(struct bench_mem_info *info __maybe_unused,
+		     struct bench_params *p __maybe_unused,
+		     void **src, void **dst)
+{
+	free(*dst);
+	free(*src);
+
+	*dst = *src = NULL;
 }
 
 struct function memcpy_functions[] = {
 	{ .name		= "default",
 	  .desc		= "Default memcpy() provided by glibc",
+	  .fn.init	= mem_alloc,
+	  .fn.fini	= mem_free,
 	  .fn.memcpy	= memcpy },
 
 #ifdef HAVE_ARCH_X86_64_SUPPORT
-# define MEMCPY_FN(_fn, _name, _desc) {.name = _name, .desc = _desc, .fn.memcpy = _fn},
+# define MEMCPY_FN(_fn, _init, _fini, _name, _desc)	\
+	{.name = _name, .desc = _desc, .fn.memcpy = _fn, .fn.init = _init, .fn.fini = _fini },
 # include "mem-memcpy-x86-64-asm-def.h"
 # undef MEMCPY_FN
 #endif
@@ -311,8 +345,8 @@ int bench_mem_memcpy(int argc, const char **argv)
 	return bench_mem_common(argc, argv, &info);
 }
 
-static union bench_clock do_memset(const struct function *r, struct bench_params *p,
-				   void *src __maybe_unused, void *dst)
+static int do_memset(const struct function *r, struct bench_params *p,
+		     void *src __maybe_unused, void *dst, union bench_clock *rt)
 {
 	union bench_clock start, end;
 	memset_t fn = r->fn.memset;
@@ -328,7 +362,9 @@ static union bench_clock do_memset(const struct function *r, struct bench_params
 		fn(dst, i, p->size);
 	clock_get(&end);
 
-	return clock_diff(&start, &end);
+	*rt = clock_diff(&start, &end);
+
+	return 0;
 }
 
 static const char * const bench_mem_memset_usage[] = {
@@ -339,10 +375,13 @@ static const char * const bench_mem_memset_usage[] = {
 static const struct function memset_functions[] = {
 	{ .name		= "default",
 	  .desc		= "Default memset() provided by glibc",
+	  .fn.init	= mem_alloc,
+	  .fn.fini	= mem_free,
 	  .fn.memset	= memset },
 
 #ifdef HAVE_ARCH_X86_64_SUPPORT
-# define MEMSET_FN(_fn, _name, _desc) { .name = _name, .desc = _desc, .fn.memset = _fn },
+# define MEMSET_FN(_fn, _init, _fini, _name, _desc) \
+	{.name = _name, .desc = _desc, .fn.memset = _fn, .fn.init = _init, .fn.fini = _fini },
 # include "mem-memset-x86-64-asm-def.h"
 # undef MEMSET_FN
 #endif
diff --git a/tools/perf/bench/mem-memcpy-arch.h b/tools/perf/bench/mem-memcpy-arch.h
index 5bcaec5601a8..852e48cfd8fe 100644
--- a/tools/perf/bench/mem-memcpy-arch.h
+++ b/tools/perf/bench/mem-memcpy-arch.h
@@ -2,7 +2,7 @@
 
 #ifdef HAVE_ARCH_X86_64_SUPPORT
 
-#define MEMCPY_FN(fn, name, desc)		\
+#define MEMCPY_FN(fn, init, fini, name, desc)		\
 	void *fn(void *, const void *, size_t);
 
 #include "mem-memcpy-x86-64-asm-def.h"
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
index 6188e19d3129..f43038f4448b 100644
--- a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
@@ -1,9 +1,13 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 MEMCPY_FN(memcpy_orig,
+	mem_alloc,
+	mem_free,
 	"x86-64-unrolled",
 	"unrolled memcpy() in arch/x86/lib/memcpy_64.S")
 
 MEMCPY_FN(__memcpy,
+	mem_alloc,
+	mem_free,
 	"x86-64-movsq",
 	"movsq-based memcpy() in arch/x86/lib/memcpy_64.S")
diff --git a/tools/perf/bench/mem-memset-arch.h b/tools/perf/bench/mem-memset-arch.h
index 53f45482663f..278c5da12d63 100644
--- a/tools/perf/bench/mem-memset-arch.h
+++ b/tools/perf/bench/mem-memset-arch.h
@@ -2,7 +2,7 @@
 
 #ifdef HAVE_ARCH_X86_64_SUPPORT
 
-#define MEMSET_FN(fn, name, desc)		\
+#define MEMSET_FN(fn, init, fini, name, desc)	\
 	void *fn(void *, int, size_t);
 
 #include "mem-memset-x86-64-asm-def.h"
diff --git a/tools/perf/bench/mem-memset-x86-64-asm-def.h b/tools/perf/bench/mem-memset-x86-64-asm-def.h
index 247c72fdfb9d..80ad1b7ea770 100644
--- a/tools/perf/bench/mem-memset-x86-64-asm-def.h
+++ b/tools/perf/bench/mem-memset-x86-64-asm-def.h
@@ -1,9 +1,13 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 MEMSET_FN(memset_orig,
+	mem_alloc,
+	mem_free,
 	"x86-64-unrolled",
 	"unrolled memset() in arch/x86/lib/memset_64.S")
 
 MEMSET_FN(__memset,
+	mem_alloc,
+	mem_free,
 	"x86-64-stosq",
 	"movsq-based memset() in arch/x86/lib/memset_64.S")
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 05/13] perf bench mem: Switch from zalloc() to mmap()
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (3 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 04/13] perf bench mem: Pull out init/fini logic Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 06/13] perf bench mem: Allow mapping of hugepages Ankur Arora
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Using mmap() ensures that the buffer is always aligned at a fixed
boundary. Switch to that to remove one source of variability.

Since we always want to read/write from the the allocated buffers map
with pagetables pre-populated.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 06d3ee6f5d69..914f9048d982 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -22,9 +22,9 @@
 #include <string.h>
 #include <unistd.h>
 #include <sys/time.h>
+#include <sys/mman.h>
 #include <errno.h>
 #include <linux/time64.h>
-#include <linux/zalloc.h>
 
 #define K 1024
 
@@ -285,16 +285,33 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
 	return 0;
 }
 
+static void *bench_mmap(size_t size, bool populate)
+{
+	void *p;
+	int extra = populate ? MAP_POPULATE : 0;
+
+	p = mmap(NULL, size, PROT_READ|PROT_WRITE,
+		 extra | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+
+	return p == MAP_FAILED ? NULL : p;
+}
+
+static void bench_munmap(void *p, size_t size)
+{
+	if (p)
+		munmap(p, size);
+}
+
 static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
 		      void **src, void **dst)
 {
 	bool failed;
 
-	*dst = zalloc(p->size);
+	*dst = bench_mmap(p->size, true);
 	failed = *dst == NULL;
 
 	if (info->alloc_src) {
-		*src = zalloc(p->size);
+		*src = bench_mmap(p->size, true);
 		failed = failed || *src == NULL;
 	}
 
@@ -305,8 +322,8 @@ static void mem_free(struct bench_mem_info *info __maybe_unused,
 		     struct bench_params *p __maybe_unused,
 		     void **src, void **dst)
 {
-	free(*dst);
-	free(*src);
+	bench_munmap(*dst, p->size);
+	bench_munmap(*src, p->size);
 
 	*dst = *src = NULL;
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 06/13] perf bench mem: Allow mapping of hugepages
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (4 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 05/13] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 07/13] perf bench mem: Allow chunking on a memory region Ankur Arora
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Page sizes that can be selected: 4KB, 2MB, 1GB.

Both the reservation and node from which hugepages are allocated
from are expected to be addressed by the user.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 33 ++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 914f9048d982..e4d713587d45 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -25,11 +25,17 @@
 #include <sys/mman.h>
 #include <errno.h>
 #include <linux/time64.h>
+#include <linux/log2.h>
 
 #define K 1024
 
+#define PAGE_SHIFT_4KB		12
+#define PAGE_SHIFT_2MB		21
+#define PAGE_SHIFT_1GB		30
+
 static const char	*size_str	= "1MB";
 static const char	*function_str	= "all";
+static const char	*page_size_str	= "4KB";
 static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
@@ -39,6 +45,10 @@ static const struct option options[] = {
 		    "Specify the size of the memory buffers. "
 		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
 
+	OPT_STRING('p', "page", &page_size_str, "4KB",
+		    "Specify page-size for mapping memory buffers. "
+		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
+
 	OPT_STRING('f', "function", &function_str, "all",
 		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
 
@@ -60,6 +70,7 @@ struct bench_params {
 	size_t		size;
 	size_t		size_total;
 	unsigned int	nr_loops;
+	unsigned int	page_shift;
 };
 
 struct bench_mem_info {
@@ -202,7 +213,8 @@ static void __bench_mem_function(struct bench_mem_info *info, struct bench_param
 	if (r->fn.fini) r->fn.fini(info, p, &src, &dst);
 	return;
 out_init_failed:
-	printf("# Memory allocation failed - maybe size (%s) is too large?\n", size_str);
+	printf("# Memory allocation failed - maybe size (%s) %s?\n", size_str,
+			p->page_shift != PAGE_SHIFT_4KB ? "has insufficient hugepages" : "is too large");
 	goto out_free;
 }
 
@@ -210,6 +222,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 {
 	int i;
 	struct bench_params p = { 0 };
+	unsigned int page_size;
 
 	argc = parse_options(argc, argv, options, info->usage, 0);
 
@@ -229,6 +242,15 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	}
 	p.size_total = (size_t)p.size * p.nr_loops;
 
+	page_size = (unsigned int)perf_atoll((char *)page_size_str);
+	if (page_size != (1 << PAGE_SHIFT_4KB) &&
+	    page_size != (1 << PAGE_SHIFT_2MB) &&
+	    page_size != (1 << PAGE_SHIFT_1GB)) {
+		fprintf(stderr, "Invalid page-size:%s\n", page_size_str);
+		return 1;
+	}
+	p.page_shift = ilog2(page_size);
+
 	if (!strncmp(function_str, "all", 3)) {
 		for (i = 0; info->functions[i].name; i++)
 			__bench_mem_function(info, &p, i);
@@ -285,11 +307,14 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
 	return 0;
 }
 
-static void *bench_mmap(size_t size, bool populate)
+static void *bench_mmap(size_t size, bool populate, unsigned int page_shift)
 {
 	void *p;
 	int extra = populate ? MAP_POPULATE : 0;
 
+	if (page_shift != PAGE_SHIFT_4KB)
+		extra |= MAP_HUGETLB | (page_shift << MAP_HUGE_SHIFT);
+
 	p = mmap(NULL, size, PROT_READ|PROT_WRITE,
 		 extra | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
 
@@ -307,11 +332,11 @@ static bool mem_alloc(struct bench_mem_info *info, struct bench_params *p,
 {
 	bool failed;
 
-	*dst = bench_mmap(p->size, true);
+	*dst = bench_mmap(p->size, true, p->page_shift);
 	failed = *dst == NULL;
 
 	if (info->alloc_src) {
-		*src = bench_mmap(p->size, true);
+		*src = bench_mmap(p->size, true, p->page_shift);
 		failed = failed || *src == NULL;
 	}
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 07/13] perf bench mem: Allow chunking on a memory region
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (5 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 06/13] perf bench mem: Allow mapping of hugepages Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 08/13] perf bench mem: Refactor mem_options Ankur Arora
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

There can be a significant gap in memset/memcpy performance depending
on the size of the region being operated on.

With chunk-size=4kb:

  $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

  $ perf bench mem memset -p 4kb -k 4kb -s 4gb -l 10 -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4gb bytes ...

      13.011655 GB/sec

With chunk-size=1gb:

  $ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

  $ perf bench mem memset -p 4kb -k 1gb -s 4gb -l 10 -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4gb bytes ...

      21.936355 GB/sec

So, allow the user to specify the chunk-size.

The default value is identical to the total size of the region, which
preserves current behaviour.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index e4d713587d45..412d18f2cb2e 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -36,6 +36,7 @@
 static const char	*size_str	= "1MB";
 static const char	*function_str	= "all";
 static const char	*page_size_str	= "4KB";
+static const char	*chunk_size_str	= "0";
 static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
@@ -49,6 +50,10 @@ static const struct option options[] = {
 		    "Specify page-size for mapping memory buffers. "
 		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
 
+	OPT_STRING('k', "chunk", &chunk_size_str, "0",
+		    "Specify the chunk-size for each invocation. "
+		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
+
 	OPT_STRING('f', "function", &function_str, "all",
 		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
 
@@ -69,6 +74,7 @@ union bench_clock {
 struct bench_params {
 	size_t		size;
 	size_t		size_total;
+	size_t		chunk_size;
 	unsigned int	nr_loops;
 	unsigned int	page_shift;
 };
@@ -242,6 +248,14 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	}
 	p.size_total = (size_t)p.size * p.nr_loops;
 
+	p.chunk_size = (size_t)perf_atoll((char *)chunk_size_str);
+	if ((s64)p.chunk_size < 0 || (s64)p.chunk_size > (s64)p.size) {
+		fprintf(stderr, "Invalid chunk_size:%s\n", chunk_size_str);
+		return 1;
+	}
+	if (!p.chunk_size)
+		p.chunk_size = p.size;
+
 	page_size = (unsigned int)perf_atoll((char *)page_size_str);
 	if (page_size != (1 << PAGE_SHIFT_4KB) &&
 	    page_size != (1 << PAGE_SHIFT_2MB) &&
@@ -299,7 +313,8 @@ static int do_memcpy(const struct function *r, struct bench_params *p,
 
 	clock_get(&start);
 	for (unsigned int i = 0; i < p->nr_loops; ++i)
-		fn(dst, src, p->size);
+		for (size_t off = 0; off < p->size; off += p->chunk_size)
+			fn(dst + off, src + off, min(p->chunk_size, p->size - off));
 	clock_get(&end);
 
 	*rt = clock_diff(&start, &end);
@@ -401,7 +416,8 @@ static int do_memset(const struct function *r, struct bench_params *p,
 
 	clock_get(&start);
 	for (unsigned int i = 0; i < p->nr_loops; ++i)
-		fn(dst, i, p->size);
+		for (size_t off = 0; off < p->size; off += p->chunk_size)
+			fn(dst + off, i, min(p->chunk_size, p->size - off));
 	clock_get(&end);
 
 	*rt = clock_diff(&start, &end);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 08/13] perf bench mem: Refactor mem_options
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (6 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 07/13] perf bench mem: Allow chunking on a memory region Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 09/13] perf bench mem: Add mmap() workloads Ankur Arora
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Split mem benchmark options into common and memset/memcpy specific.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/mem-functions.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 412d18f2cb2e..8a37da149327 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -41,7 +41,7 @@ static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
 
-static const struct option options[] = {
+static const struct option bench_common_options[] = {
 	OPT_STRING('s', "size", &size_str, "1MB",
 		    "Specify the size of the memory buffers. "
 		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
@@ -50,10 +50,6 @@ static const struct option options[] = {
 		    "Specify page-size for mapping memory buffers. "
 		    "Available sizes: 4KB, 2MB, 1GB (case insensitive)"),
 
-	OPT_STRING('k', "chunk", &chunk_size_str, "0",
-		    "Specify the chunk-size for each invocation. "
-		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
-
 	OPT_STRING('f', "function", &function_str, "all",
 		    "Specify the function to run, \"all\" runs all available functions, \"help\" lists them"),
 
@@ -66,6 +62,14 @@ static const struct option options[] = {
 	OPT_END()
 };
 
+static const struct option bench_mem_options[] = {
+	OPT_STRING('k', "chunk", &chunk_size_str, "0",
+		    "Specify the chunk-size for each invocation. "
+		    "Available units: B, KB, MB, GB and TB (case insensitive)"),
+	OPT_PARENT(bench_common_options),
+	OPT_END()
+};
+
 union bench_clock {
 	u64		cycles;
 	struct timeval	tv;
@@ -84,6 +88,7 @@ struct bench_mem_info {
 	int (*do_op)(const struct function *r, struct bench_params *p,
 		     void *src, void *dst, union bench_clock *rt);
 	const char *const *usage;
+	const struct option *options;
 	bool alloc_src;
 };
 
@@ -230,7 +235,7 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	struct bench_params p = { 0 };
 	unsigned int page_size;
 
-	argc = parse_options(argc, argv, options, info->usage, 0);
+	argc = parse_options(argc, argv, info->options, info->usage, 0);
 
 	if (use_cycles) {
 		i = init_cycles();
@@ -396,6 +401,7 @@ int bench_mem_memcpy(int argc, const char **argv)
 		.functions		= memcpy_functions,
 		.do_op			= do_memcpy,
 		.usage			= bench_mem_memcpy_usage,
+		.options		= bench_mem_options,
 		.alloc_src              = true,
 	};
 
@@ -453,6 +459,7 @@ int bench_mem_memset(int argc, const char **argv)
 		.functions		= memset_functions,
 		.do_op			= do_memset,
 		.usage			= bench_mem_memset_usage,
+		.options		= bench_mem_options,
 	};
 
 	return bench_mem_common(argc, argv, &info);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 09/13] perf bench mem: Add mmap() workloads
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (7 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 08/13] perf bench mem: Refactor mem_options Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 10/13] x86/mm: Simplify clear_page_* Ankur Arora
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Add two mmap() workloads: one that eagerly populates a region and
another that demand faults it in.

The intent is to probe the memory subsytem performance incurred
by mmap().

  $ perf bench mem map -s 4gb -p 4kb -l 10 -f populate
  # Running 'mem/map' benchmark:
  # function 'populate' (Eagerly populated map)
  # Copying 4gb bytes ...

       1.811691 GB/sec

  $ perf bench mem map -s 4gb -p 2mb -l 10 -f populate
  # Running 'mem/map' benchmark:
  # function 'populate' (Eagerly populated map)
  # Copying 4gb bytes ...

      12.272017 GB/sec

  $ perf bench mem map -s 4gb -p 1gb -l 10 -f populate
  # Running 'mem/map' benchmark:
  # function 'populate' (Eagerly populated map)
  # Copying 4gb bytes ...

      17.085927 GB/sec

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 tools/perf/bench/bench.h         |  1 +
 tools/perf/bench/mem-functions.c | 96 ++++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c       |  1 +
 3 files changed, 98 insertions(+)

diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 9f736423af53..46484bb0eefb 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -28,6 +28,7 @@ int bench_syscall_fork(int argc, const char **argv);
 int bench_syscall_execve(int argc, const char **argv);
 int bench_mem_memcpy(int argc, const char **argv);
 int bench_mem_memset(int argc, const char **argv);
+int bench_mem_map(int argc, const char **argv);
 int bench_mem_find_bit(int argc, const char **argv);
 int bench_futex_hash(int argc, const char **argv);
 int bench_futex_wake(int argc, const char **argv);
diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c
index 8a37da149327..ea62e3583a70 100644
--- a/tools/perf/bench/mem-functions.c
+++ b/tools/perf/bench/mem-functions.c
@@ -40,6 +40,7 @@ static const char	*chunk_size_str	= "0";
 static unsigned int	nr_loops	= 1;
 static bool		use_cycles;
 static int		cycles_fd;
+static unsigned int	seed;
 
 static const struct option bench_common_options[] = {
 	OPT_STRING('s', "size", &size_str, "1MB",
@@ -81,6 +82,7 @@ struct bench_params {
 	size_t		chunk_size;
 	unsigned int	nr_loops;
 	unsigned int	page_shift;
+	unsigned int	seed;
 };
 
 struct bench_mem_info {
@@ -98,6 +100,7 @@ typedef void (*mem_fini_t)(struct bench_mem_info *, struct bench_params *,
 			   void **, void **);
 typedef void *(*memcpy_t)(void *, const void *, size_t);
 typedef void *(*memset_t)(void *, int, size_t);
+typedef void (*map_op_t)(void *, size_t, unsigned int, bool);
 
 struct function {
 	const char *name;
@@ -108,6 +111,7 @@ struct function {
 		union {
 			memcpy_t memcpy;
 			memset_t memset;
+			map_op_t map_op;
 		};
 	} fn;
 };
@@ -160,6 +164,14 @@ static union bench_clock clock_diff(union bench_clock *s, union bench_clock *e)
 	return t;
 }
 
+static void clock_accum(union bench_clock *a, union bench_clock *b)
+{
+	if (use_cycles)
+		a->cycles += b->cycles;
+	else
+		timeradd(&a->tv, &b->tv, &a->tv);
+}
+
 static double timeval2double(struct timeval *ts)
 {
 	return (double)ts->tv_sec + (double)ts->tv_usec / (double)USEC_PER_SEC;
@@ -270,6 +282,8 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info *
 	}
 	p.page_shift = ilog2(page_size);
 
+	p.seed = seed;
+
 	if (!strncmp(function_str, "all", 3)) {
 		for (i = 0; info->functions[i].name; i++)
 			__bench_mem_function(info, &p, i);
@@ -464,3 +478,85 @@ int bench_mem_memset(int argc, const char **argv)
 
 	return bench_mem_common(argc, argv, &info);
 }
+
+static void map_page_touch(void *dst, size_t size, unsigned int page_shift, bool random)
+{
+	unsigned long npages = size / (1 << page_shift);
+	unsigned long offset = 0, r = 0;
+
+	for (unsigned long i = 0; i < npages; i++) {
+		if (random)
+			r = rand() % (1 << page_shift);
+
+		*((char *)dst + offset + r) = *(char *)(dst + offset + r) + i;
+		offset += 1 << page_shift;
+	}
+}
+
+static int do_map(const struct function *r, struct bench_params *p,
+		  void *src __maybe_unused, void *dst __maybe_unused,
+		  union bench_clock *accum)
+{
+	union bench_clock start, end, diff;
+	map_op_t fn = r->fn.map_op;
+	bool populate = strcmp(r->name, "populate") == 0;
+
+	if (p->seed)
+		srand(p->seed);
+
+	for (unsigned int i = 0; i < p->nr_loops; i++) {
+		clock_get(&start);
+		dst = bench_mmap(p->size, populate, p->page_shift);
+		if (!dst)
+			goto out;
+
+		fn(dst, p->size, p->page_shift, p->seed);
+		clock_get(&end);
+		diff = clock_diff(&start, &end);
+		clock_accum(accum, &diff);
+
+		bench_munmap(dst, p->size);
+	}
+
+	return 0;
+out:
+	printf("# Memory allocation failed - maybe size (%s) %s?\n", size_str,
+			p->page_shift != PAGE_SHIFT_4KB ? "has insufficient hugepages" : "is too large");
+	return -1;
+}
+
+static const char * const bench_mem_map_usage[] = {
+	"perf bench mem map <options>",
+	NULL
+};
+
+static const struct function map_functions[] = {
+	{ .name		= "populate",
+	  .desc		= "Eagerly populated map",
+	  .fn.map_op	= map_page_touch },
+
+	{ .name		= "demand",
+	  .desc		= "Demand loaded map",
+	  .fn.map_op	= map_page_touch },
+
+	{ .name = NULL, }
+};
+
+int bench_mem_map(int argc, const char **argv)
+{
+	static const struct option bench_map_options[] = {
+		OPT_UINTEGER('r', "randomize", &seed,
+			    "Seed to randomize page RW offset with."),
+		OPT_PARENT(bench_common_options),
+		OPT_END()
+	};
+
+	struct bench_mem_info info = {
+		.functions		= map_functions,
+		.do_op			= do_map,
+		.usage			= bench_mem_map_usage,
+		.options		= bench_map_options,
+	};
+
+	return bench_mem_common(argc, argv, &info);
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index 2c1a9f3d847a..a20bd9882f0a 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -65,6 +65,7 @@ static struct bench mem_benchmarks[] = {
 	{ "memcpy",	"Benchmark for memcpy() functions",		bench_mem_memcpy	},
 	{ "memset",	"Benchmark for memset() functions",		bench_mem_memset	},
 	{ "find_bit",	"Benchmark for find_bit() functions",		bench_mem_find_bit	},
+	{ "map",	"Benchmark for mmap() mappings",		bench_mem_map		},
 	{ "all",	"Run all memory access benchmarks",		NULL			},
 	{ NULL,		NULL,						NULL			}
 };
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 10/13] x86/mm: Simplify clear_page_*
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (8 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 09/13] perf bench mem: Add mmap() workloads Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16 14:35   ` Dave Hansen
  2025-06-16 16:48   ` kernel test robot
  2025-06-16  5:22 ` [PATCH v4 11/13] x86/clear_page: Introduce clear_pages() Ankur Arora
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
variations. Inlining gets rid of the costly call/ret (for cases with
speculative execution related mitigations.)

Also, fixup and rename clear_page_orig() to adapt to the changed calling
convention.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_64.h | 19 ++++++++---------
 arch/x86/lib/clear_page_64.S   | 39 +++++++---------------------------
 2 files changed, 17 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 015d23f3e01f..596333bd0c73 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,23 +40,22 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 #define __phys_reloc_hide(x)	(x)
 
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void memzero_page_aligned_unrolled(void *addr, u64 len);
 
 static inline void clear_page(void *page)
 {
+	u64 len = PAGE_SIZE;
 	/*
 	 * Clean up KMSAN metadata for the page being cleared. The assembly call
 	 * below clobbers @page, so we perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, PAGE_SIZE);
-	alternative_call_2(clear_page_orig,
-			   clear_page_rep, X86_FEATURE_REP_GOOD,
-			   clear_page_erms, X86_FEATURE_ERMS,
-			   "=D" (page),
-			   "D" (page),
-			   "cc", "memory", "rax", "rcx");
+	kmsan_unpoison_memory(page, len);
+	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
+				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
+				   "rep stosb", X86_FEATURE_ERMS)
+			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
+			: "a" (0)
+			: "cc", "memory");
 }
 
 void copy_page(void *to, void *from);
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index a508e4a8c66a..27debe0c018c 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -6,30 +6,15 @@
 #include <asm/asm.h>
 
 /*
- * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
- * recommended to use this when possible and we do use them by default.
- * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
- * Otherwise, use original.
+ * Zero page aligned region.
+ * %rdi	- dest
+ * %rcx	- length
  */
-
-/*
- * Zero a page.
- * %rdi	- page
- */
-SYM_TYPED_FUNC_START(clear_page_rep)
-	movl $4096/8,%ecx
-	xorl %eax,%eax
-	rep stosq
-	RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
-
-SYM_TYPED_FUNC_START(clear_page_orig)
-	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+SYM_TYPED_FUNC_START(memzero_page_aligned_unrolled)
+	shrq   $6, %rcx
 	.p2align 4
 .Lloop:
-	decl	%ecx
+	decq	%rcx
 #define PUT(x) movq %rax,x*8(%rdi)
 	movq %rax,(%rdi)
 	PUT(1)
@@ -43,16 +28,8 @@ SYM_TYPED_FUNC_START(clear_page_orig)
 	jnz	.Lloop
 	nop
 	RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
-
-SYM_TYPED_FUNC_START(clear_page_erms)
-	movl $4096,%ecx
-	xorl %eax,%eax
-	rep stosb
-	RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(memzero_page_aligned_unrolled)
+EXPORT_SYMBOL_GPL(memzero_page_aligned_unrolled)
 
 /*
  * Default clear user-space.
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 11/13] x86/clear_page: Introduce clear_pages()
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (9 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 10/13] x86/mm: Simplify clear_page_* Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 12/13] mm: memory: allow arch override for folio_zero_user() Ankur Arora
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Performance when clearing with string instructions (x86-64-stosq and
similar) can vary significantly based on the chunk-size used.

  $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      13.748208 GB/sec

  $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in
  # arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      15.067900 GB/sec

  $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      38.104311 GB/sec

(Both on AMD Milan.)

With a change in chunk-size of 4KB to 1GB, we see the performance go
from 13.7 GB/sec to 38.1 GB/sec. For a chunk-size of 2MB the change isn't
quite as drastic but it is worth adding a multi-page variant.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_32.h | 18 ++++++++++++++++--
 arch/x86/include/asm/page_64.h | 25 +++++++++++++++++++------
 2 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 0c623706cb7e..66e84b4b8a0f 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -17,9 +17,23 @@ extern unsigned long __phys_addr(unsigned long);
 
 #include <linux/string.h>
 
-static inline void clear_page(void *page)
+/*
+ * clear_pages() - clear kernel page range.
+ * @addr: page aligned pointer
+ * @npages: number of pages
+ *
+ * Assumes that (@addr, +@npage) references a kernel region.
+ * Does absolutely no exception handling.
+ */
+static inline void clear_pages(void *addr, u64 npages)
 {
-	memset(page, 0, PAGE_SIZE);
+	for (u64 i = 0; i < npages; i++)
+		memset(addr + i * PAGE_SIZE, 0, PAGE_SIZE);
+}
+
+static inline void clear_page(void *addr)
+{
+	clear_pages(addr, 1);
 }
 
 static inline void copy_page(void *to, void *from)
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 596333bd0c73..1b8be71fd45c 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -42,22 +42,35 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 void memzero_page_aligned_unrolled(void *addr, u64 len);
 
-static inline void clear_page(void *page)
+/*
+ * clear_pages() - clear kernel page range.
+ * @addr: page aligned pointer
+ * @npages: number of pages
+ *
+ * Assumes that (@addr, +@npage) references a kernel region.
+ * Does absolutely no exception handling.
+ */
+static inline void clear_pages(void *addr, u64 npages)
 {
-	u64 len = PAGE_SIZE;
+	u64 len = npages * PAGE_SIZE;
 	/*
-	 * Clean up KMSAN metadata for the page being cleared. The assembly call
-	 * below clobbers @page, so we perform unpoisoning before it.
+	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
+	 * below clobbers @addr, so we perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, len);
+	kmsan_unpoison_memory(addr, len);
 	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
 				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
 				   "rep stosb", X86_FEATURE_ERMS)
-			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
+			: "+c" (len), "+D" (addr), ASM_CALL_CONSTRAINT
 			: "a" (0)
 			: "cc", "memory");
 }
 
+static inline void clear_page(void *addr)
+{
+	clear_pages(addr, 1);
+}
+
 void copy_page(void *to, void *from);
 KCFI_REFERENCE(copy_page);
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 12/13] mm: memory: allow arch override for folio_zero_user()
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (10 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 11/13] x86/clear_page: Introduce clear_pages() Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16  5:22 ` [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing Ankur Arora
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

folio_zero_user() is constrained to operate in a page-at-a-time
fashion because it handles CONFIG_HIGHMEM which means that the
pages in a folio might be mapped to a discontiguous kernel address
range.

In addition, cooperative preemption models (none, voluntary) force
zeroing of successive chunks to be interspersed with invocations
invocations of cond_resched().

Allow an architecture specific override.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8eba595056fe..e769480b712a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7079,8 +7079,11 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
  * folio_zero_user - Zero a folio which will be mapped to userspace.
  * @folio: The folio to zero.
  * @addr_hint: The address will be accessed or the base address if uncelar.
+ *
+ * folio_zero_user() does page-at-a-time zeroing because it needs to handle
+ * CONFIG_HIGHMEM. Allow architecture override.
  */
-void folio_zero_user(struct folio *folio, unsigned long addr_hint)
+void __weak folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
 	unsigned int nr_pages = folio_nr_pages(folio);
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (11 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 12/13] mm: memory: allow arch override for folio_zero_user() Ankur Arora
@ 2025-06-16  5:22 ` Ankur Arora
  2025-06-16 11:39   ` kernel test robot
  2025-06-16 14:44   ` Dave Hansen
  2025-06-16 15:06 ` [PATCH v4 00/13] x86/mm: " Dave Hansen
  2025-07-04  8:15 ` Raghavendra K T
  14 siblings, 2 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16  5:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Override the common code version of folio_zero_user() so we can use
clear_pages() to do multi-page clearing instead of the standard
page-at-a-time clearing. This allows us to advertise the full
region-size to the processor, which when using string instructions
(REP; STOS), can use the knowledge of the extent to optimize the
clearing.

Apart from this we have two other considerations: cache locality when
clearing 2MB pages, and preemption latency when clearing GB pages.

The first is handled by breaking the clearing in three parts: the
faulting page and its immediate locality, its left and right regions;
with the local neighbourhood cleared last.

The second is only an issue for kernels running under cooperative
preemption. Limit the worst case preemption latency by clearing in
PAGE_RESCHED_CHUNK (8MB) units.

The resultant performance falls in two buckets depending on the kinds of
optimizations that the uarch can do for the clearing extent. Two classes
of optimizations:

  - amortize each clearing iteration over a large range instead of at
    a page granularity.
  - cacheline allocation elision (seen only on AMD Zen models)

A demand fault workload shows that the resultant performance falls in two
buckets depending on if the extent being zeroed is large enough to allow
for cacheline allocation elision.

AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

 $ perf bench mem map -p $page-size -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

  pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%
  pg-sz=1GB       17.51  +- 1.19%        40.03  +-  7.26% [#]   +129.9%

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than PAGE_RESCHED_CHUNK, so
preempt=none|voluntary sees no improvement for this test.

  pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98%        + 1.6%

The dropoff in cacheline allocations for pg-sz=1GB can be seen with
perf-stat:

   - 44,513,459,667      cycles                           #    2.420 GHz                         ( +-  0.44% )  (35.71%)
   -  1,378,032,592      instructions                     #    0.03  insn per cycle
   - 11,224,288,082      L1-dcache-loads                  #  610.187 M/sec                       ( +-  0.08% )  (35.72%)
   -  5,373,473,118      L1-dcache-load-misses            #   47.87% of all L1-dcache accesses   ( +-  0.00% )  (35.71%)

   + 20,093,219,076      cycles                           #    2.421 GHz                         ( +-  3.64% )  (35.69%)
   +  1,378,032,592      instructions                     #    0.03  insn per cycle
   +    186,525,095      L1-dcache-loads                  #   22.479 M/sec                       ( +-  2.11% )  (35.74%)
   +     73,479,687      L1-dcache-load-misses            #   39.39% of all L1-dcache accesses   ( +-  3.03% )  (35.74%)

Also note that as mentioned earlier, this improvement is not specific to
AMD Zen*. Intel Icelakex (pg-sz=2MB|1GB) sees a similar improvement as
the Milan pg-sz=2MB workload above (~35%).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/mm/Makefile |  1 +
 arch/x86/mm/memory.c | 97 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 arch/x86/mm/memory.c

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5b9908f13dcf..9031faf21849 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
 obj-$(CONFIG_NUMA)		+= numa.o
 obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat.o
+obj-$(CONFIG_PREEMPTION)	+= memory.o
 
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
new file mode 100644
index 000000000000..a799c0cc3c5f
--- /dev/null
+++ b/arch/x86/mm/memory.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/mm.h>
+#include <linux/range.h>
+#include <linux/minmax.h>
+
+/*
+ * Limit the optimized version of folio_zero_user() to !CONFIG_HIGHMEM.
+ * We do that because clear_pages() works on contiguous kernel pages
+ * which might not be true under HIGHMEM.
+ */
+#ifndef CONFIG_HIGHMEM
+/*
+ * For voluntary preemption models, operate with a max chunk-size of 8MB.
+ * (Worst case resched latency of ~1ms, with a clearing BW of ~10GBps.)
+ */
+#define PAGE_RESCHED_CHUNK	(8 << (20 - PAGE_SHIFT))
+
+static void clear_pages_resched(void *addr, int npages)
+{
+	int i, remaining;
+
+	if (preempt_model_preemptible()) {
+		clear_pages(addr, npages);
+		goto out;
+	}
+
+	for (i = 0; i < npages/PAGE_RESCHED_CHUNK; i++) {
+		clear_pages(addr + i * PAGE_RESCHED_CHUNK * PAGE_SIZE, PAGE_RESCHED_CHUNK);
+		cond_resched();
+	}
+
+	remaining = npages % PAGE_RESCHED_CHUNK;
+
+	if (remaining)
+		clear_pages(addr + i * PAGE_RESCHED_CHUNK * PAGE_SHIFT, remaining);
+out:
+	cond_resched();
+}
+
+/*
+ * folio_zero_user() - multi-page clearing.
+ *
+ * @folio: hugepage folio
+ * @addr_hint: faulting address (if any)
+ *
+ * Overrides common code folio_zero_user(). This version takes advantage of
+ * the fact that string instructions in clear_pages() are more performant
+ * on larger extents compared to the usual page-at-a-time clearing.
+ *
+ * Clearing of 2MB pages is split in three parts: pages in the immediate
+ * locality of the faulting page, and its left, right regions; with the local
+ * neighbourhood cleared last in order to keep cache lines of the target
+ * region hot.
+ *
+ * For GB pages, there is no expectation of cache locality so just do a
+ * straight zero.
+ *
+ * Note that the folio is fully allocated already so we don't do any exception
+ * handling.
+ */
+void folio_zero_user(struct folio *folio, unsigned long addr_hint)
+{
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
+
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		int npages = range_len(&r[i]);
+
+		if (npages > 0)
+			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
+	}
+}
+#endif /* CONFIG_HIGHMEM */
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16  5:22 ` [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing Ankur Arora
@ 2025-06-16 11:39   ` kernel test robot
  2025-06-16 14:44   ` Dave Hansen
  1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-06-16 11:39 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: llvm, oe-kbuild-all, akpm, bp, dave.hansen, hpa, mingo, mjguzik,
	luto, acme, namhyung, tglx, willy, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, ankur.a.arora

Hi Ankur,

kernel test robot noticed the following build warnings:

[auto build test WARNING on perf-tools-next/perf-tools-next]
[also build test WARNING on tip/perf/core perf-tools/perf-tools linus/master v6.16-rc2 next-20250616]
[cannot apply to acme/perf/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250616-132651
base:   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git perf-tools-next
patch link:    https://lore.kernel.org/r/20250616052223.723982-14-ankur.a.arora%40oracle.com
patch subject: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
config: x86_64-buildonly-randconfig-003-20250616 (https://download.01.org/0day-ci/archive/20250616/202506161939.YrEAfTPY-lkp@intel.com/config)
compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247)
rustc: rustc 1.78.0 (9b00956e5 2024-04-29)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250616/202506161939.YrEAfTPY-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506161939.YrEAfTPY-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> arch/x86/mm/memory.c:61:6: warning: no previous prototype for function 'folio_zero_user' [-Wmissing-prototypes]
      61 | void folio_zero_user(struct folio *folio, unsigned long addr_hint)
         |      ^
   arch/x86/mm/memory.c:61:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
      61 | void folio_zero_user(struct folio *folio, unsigned long addr_hint)
         | ^
         | static 
   1 warning generated.


vim +/folio_zero_user +61 arch/x86/mm/memory.c

    39	
    40	/*
    41	 * folio_zero_user() - multi-page clearing.
    42	 *
    43	 * @folio: hugepage folio
    44	 * @addr_hint: faulting address (if any)
    45	 *
    46	 * Overrides common code folio_zero_user(). This version takes advantage of
    47	 * the fact that string instructions in clear_pages() are more performant
    48	 * on larger extents compared to the usual page-at-a-time clearing.
    49	 *
    50	 * Clearing of 2MB pages is split in three parts: pages in the immediate
    51	 * locality of the faulting page, and its left, right regions; with the local
    52	 * neighbourhood cleared last in order to keep cache lines of the target
    53	 * region hot.
    54	 *
    55	 * For GB pages, there is no expectation of cache locality so just do a
    56	 * straight zero.
    57	 *
    58	 * Note that the folio is fully allocated already so we don't do any exception
    59	 * handling.
    60	 */
  > 61	void folio_zero_user(struct folio *folio, unsigned long addr_hint)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
  2025-06-16  5:22 ` [PATCH v4 10/13] x86/mm: Simplify clear_page_* Ankur Arora
@ 2025-06-16 14:35   ` Dave Hansen
  2025-06-16 14:38     ` Peter Zijlstra
  2025-06-16 18:18     ` Ankur Arora
  2025-06-16 16:48   ` kernel test robot
  1 sibling, 2 replies; 32+ messages in thread
From: Dave Hansen @ 2025-06-16 14:35 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On 6/15/25 22:22, Ankur Arora wrote:
> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
> variations. Inlining gets rid of the costly call/ret (for cases with
> speculative execution related mitigations.)

Could you elaborate a bit on which "speculative execution related
mitigations" are so costly with these direct calls?


> -	kmsan_unpoison_memory(page, PAGE_SIZE);
> -	alternative_call_2(clear_page_orig,
> -			   clear_page_rep, X86_FEATURE_REP_GOOD,
> -			   clear_page_erms, X86_FEATURE_ERMS,
> -			   "=D" (page),
> -			   "D" (page),
> -			   "cc", "memory", "rax", "rcx");

I've got to say, I don't dislike the old code. It's utterly clear from
that code what's going on. It's arguable that it's not clear that the
rep/erms variants are just using stosb vs. stosq, but the high level
concept of "use a feature flag to switch between three implementations
of clear page" is crystal clear.

> +	kmsan_unpoison_memory(page, len);
> +	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
> +				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
> +				   "rep stosb", X86_FEATURE_ERMS)
> +			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
> +			: "a" (0)
> +			: "cc", "memory");
>  }

This is substantially less clear. It also doesn't even add comments to
make up for the decreased clarity.

>  void copy_page(void *to, void *from);
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index a508e4a8c66a..27debe0c018c 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -6,30 +6,15 @@
>  #include <asm/asm.h>
>  
>  /*
> - * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
> - * recommended to use this when possible and we do use them by default.
> - * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
> - * Otherwise, use original.
> + * Zero page aligned region.
> + * %rdi	- dest
> + * %rcx	- length
>   */

That comment was pretty useful, IMNHO.

How about we add something like this above it? I think it explains the
whole landscape, including the fact that X86_FEATURE_REP_GOOD is
synthetic and X86_FEATURE_ERMS is not:

Switch between three implementation of page clearing based on CPU
capabilities:

 1. memzero_page_aligned_unrolled(): the oldest, slowest and universally
    supported method. Uses a for loop (in assembly) to write a 64-byte
    cacheline on each loop. Each loop iteration writes to memory using
    8x 8-byte MOV instructions.
 2. "rep stosq": Really old CPUs had crummy REP implementations.
    Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
    trusted. The instruction writes 8 bytes per REP iteration but CPUs
    internally batch these together and do larger writes.
 3. "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS
    implementation that is less picky about alignment and where STOSB
    (1 byte at a time) is actually faster than STOSQ (8 bytes at a
    time).




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
  2025-06-16 14:35   ` Dave Hansen
@ 2025-06-16 14:38     ` Peter Zijlstra
  2025-06-16 18:18     ` Ankur Arora
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2025-06-16 14:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, acme, namhyung, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On Mon, Jun 16, 2025 at 07:35:48AM -0700, Dave Hansen wrote:
> On 6/15/25 22:22, Ankur Arora wrote:
> > clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
> > variations. Inlining gets rid of the costly call/ret (for cases with
> > speculative execution related mitigations.)
> 
> Could you elaborate a bit on which "speculative execution related
> mitigations" are so costly with these direct calls?

Pretty much everything with RETHUNK set I would imagine.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16  5:22 ` [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing Ankur Arora
  2025-06-16 11:39   ` kernel test robot
@ 2025-06-16 14:44   ` Dave Hansen
  2025-06-16 14:50     ` Peter Zijlstra
                       ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Dave Hansen @ 2025-06-16 14:44 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On 6/15/25 22:22, Ankur Arora wrote:
> Override the common code version of folio_zero_user() so we can use
> clear_pages() to do multi-page clearing instead of the standard
> page-at-a-time clearing.

I'm not a big fan of the naming in this series.

To me multi-page means "more than one 'struct page'". But this series is
clearly using multi-page clearing to mean clearing >PAGE_SIZE in one
clear. But oh well.

The second problem with where this ends up is that none of the code is
*actually* x86-specific. The only thing that x86 provides that's
interesting is a clear_pages() implementation that hands >PAGE_SIZE
units down to the CPUs.

The result is ~100 lines of code that will compile and run functionally
on any architecture.

To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
x86 side that then cajoles the core mm/ code to use the fancy new
clear_pages_resched() implementation.

Because what are the arm64 guys going to do when their CPUs start doing
this? They're either going to copy-and-paste the x86 implementation or
they're going to go move the refactor the x86 implementation into common
code.

My money is on the refactoring, because those arm64 guys do good work.
Could we save them the trouble, please?

Oh, and one other little thing:

> +/*
> + * Limit the optimized version of folio_zero_user() to !CONFIG_HIGHMEM.
> + * We do that because clear_pages() works on contiguous kernel pages
> + * which might not be true under HIGHMEM.
> + */

The tip trees are picky about imperative voice, so no "we's". But if you
stick this in mm/, folks are less picky. ;)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16 14:44   ` Dave Hansen
@ 2025-06-16 14:50     ` Peter Zijlstra
  2025-06-16 15:03       ` Dave Hansen
  2025-06-16 18:20       ` Ankur Arora
  2025-06-16 14:58     ` Matthew Wilcox
  2025-06-16 18:47     ` Ankur Arora
  2 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2025-06-16 14:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, acme, namhyung, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On Mon, Jun 16, 2025 at 07:44:13AM -0700, Dave Hansen wrote:

> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
> x86 side that then cajoles the core mm/ code to use the fancy new
> clear_pages_resched() implementation.

Note that we should only set this bit with either full or lazy
preemption selected. Haven't checked the patch-set to see if that
constraint is already taken care of.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16 14:44   ` Dave Hansen
  2025-06-16 14:50     ` Peter Zijlstra
@ 2025-06-16 14:58     ` Matthew Wilcox
  2025-06-16 18:47     ` Ankur Arora
  2 siblings, 0 replies; 32+ messages in thread
From: Matthew Wilcox @ 2025-06-16 14:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On Mon, Jun 16, 2025 at 07:44:13AM -0700, Dave Hansen wrote:
> To me multi-page means "more than one 'struct page'". But this series is
> clearly using multi-page clearing to mean clearing >PAGE_SIZE in one
> clear. But oh well.

I'm not sure I see the distinction you're trying to draw.  struct page
refers to a PAGE_SIZE aligned, PAGE_SIZE sized chunk of memory.  So
if you do something to more than PAGE_SIZE bytes, you're doing something
to multiple struct pages.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16 14:50     ` Peter Zijlstra
@ 2025-06-16 15:03       ` Dave Hansen
  2025-06-16 18:20       ` Ankur Arora
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2025-06-16 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, acme, namhyung, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On 6/16/25 07:50, Peter Zijlstra wrote:
> On Mon, Jun 16, 2025 at 07:44:13AM -0700, Dave Hansen wrote:
>> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
>> x86 side that then cajoles the core mm/ code to use the fancy new
>> clear_pages_resched() implementation.
> Note that we should only set this bit with either full or lazy
> preemption selected. Haven't checked the patch-set to see if that
> constraint is already taken care of.

There is a check in the C code for preempt_model_preemptible(). So as
long as there was something like:

config SOMETHING_PAGE_CLEARING
	def_bool y
	depends on ARCH_HAS_WHATEVER_PAGE_CLEARING
	depends on !HIGHMEM

Then the check for HIGHMEM and the specific architecture support could
be along the lines of:

static void clear_pages_resched(void *addr, int npages)
{
	int i, remaining;

	if (!IS_ENABLED(SOMETHING_PAGE_CLEARING) ||
	    preempt_model_preemptible()) {
		clear_pages(addr, npages);
		goto out;
	}

... which would also remove the #ifdef CONFIG_HIGHMEM in there now.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (12 preceding siblings ...)
  2025-06-16  5:22 ` [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing Ankur Arora
@ 2025-06-16 15:06 ` Dave Hansen
  2025-06-16 18:25   ` Ankur Arora
  2025-07-04  8:15 ` Raghavendra K T
  14 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2025-06-16 15:06 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On 6/15/25 22:22, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages, improving on the
> current page-at-a-time approach in two ways:
> 
>  - amortizes the per-page setup cost over a larger extent
>  - when using string instructions, exposes the real region size to the
>    processor. A processor could use that as a hint to optimize based
>    on the full extent size. AMD Zen uarchs, as an example, elide
>    allocation of cachelines for regions larger than L3-size.

Have you happened to do any testing outside of 'perf bench'?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
  2025-06-16  5:22 ` [PATCH v4 10/13] x86/mm: Simplify clear_page_* Ankur Arora
  2025-06-16 14:35   ` Dave Hansen
@ 2025-06-16 16:48   ` kernel test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-06-16 16:48 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: llvm, oe-kbuild-all, akpm, bp, dave.hansen, hpa, mingo, mjguzik,
	luto, peterz, acme, namhyung, tglx, willy, jon.grimm, bharata,
	raghavendra.kt, boris.ostrovsky, konrad.wilk, ankur.a.arora

Hi Ankur,

kernel test robot noticed the following build errors:

[auto build test ERROR on perf-tools-next/perf-tools-next]
[also build test ERROR on tip/perf/core perf-tools/perf-tools linus/master v6.16-rc2 next-20250616]
[cannot apply to acme/perf/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250616-132651
base:   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git perf-tools-next
patch link:    https://lore.kernel.org/r/20250616052223.723982-11-ankur.a.arora%40oracle.com
patch subject: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
config: x86_64-buildonly-randconfig-004-20250616 (https://download.01.org/0day-ci/archive/20250617/202506170010.r7INoexI-lkp@intel.com/config)
compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250617/202506170010.r7INoexI-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506170010.r7INoexI-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: __kcfi_typeid_memzero_page_aligned_unrolled
   >>> referenced by usercopy_64.c
   >>>               vmlinux.o:(__cfi_memzero_page_aligned_unrolled)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
  2025-06-16 14:35   ` Dave Hansen
  2025-06-16 14:38     ` Peter Zijlstra
@ 2025-06-16 18:18     ` Ankur Arora
  1 sibling, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16 18:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


Dave Hansen <dave.hansen@intel.com> writes:

> On 6/15/25 22:22, Ankur Arora wrote:
>> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
>> variations. Inlining gets rid of the costly call/ret (for cases with
>> speculative execution related mitigations.)
>
> Could you elaborate a bit on which "speculative execution related
> mitigations" are so costly with these direct calls?

I can specify that we would mispredict on the RET if you use RETHUNK.

>> -	kmsan_unpoison_memory(page, PAGE_SIZE);
>> -	alternative_call_2(clear_page_orig,
>> -			   clear_page_rep, X86_FEATURE_REP_GOOD,
>> -			   clear_page_erms, X86_FEATURE_ERMS,
>> -			   "=D" (page),
>> -			   "D" (page),
>> -			   "cc", "memory", "rax", "rcx");
>
> I've got to say, I don't dislike the old code. It's utterly clear from
> that code what's going on. It's arguable that it's not clear that the
> rep/erms variants are just using stosb vs. stosq, but the high level
> concept of "use a feature flag to switch between three implementations
> of clear page" is crystal clear.
>
>> +	kmsan_unpoison_memory(page, len);
>> +	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
>> +				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
>> +				   "rep stosb", X86_FEATURE_ERMS)
>> +			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
>> +			: "a" (0)
>> +			: "cc", "memory");
>>  }
>
> This is substantially less clear. It also doesn't even add comments to
> make up for the decreased clarity.
>
>>  void copy_page(void *to, void *from);
>> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
>> index a508e4a8c66a..27debe0c018c 100644
>> --- a/arch/x86/lib/clear_page_64.S
>> +++ b/arch/x86/lib/clear_page_64.S
>> @@ -6,30 +6,15 @@
>>  #include <asm/asm.h>
>>
>>  /*
>> - * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
>> - * recommended to use this when possible and we do use them by default.
>> - * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
>> - * Otherwise, use original.
>> + * Zero page aligned region.
>> + * %rdi	- dest
>> + * %rcx	- length
>>   */
>
> That comment was pretty useful, IMNHO.
>
> How about we add something like this above it? I think it explains the
> whole landscape, including the fact that X86_FEATURE_REP_GOOD is
> synthetic and X86_FEATURE_ERMS is not:
>
> Switch between three implementation of page clearing based on CPU
> capabilities:
>
>  1. memzero_page_aligned_unrolled(): the oldest, slowest and universally
>     supported method. Uses a for loop (in assembly) to write a 64-byte
>     cacheline on each loop. Each loop iteration writes to memory using
>     8x 8-byte MOV instructions.
>  2. "rep stosq": Really old CPUs had crummy REP implementations.
>     Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
>     trusted. The instruction writes 8 bytes per REP iteration but CPUs
>     internally batch these together and do larger writes.
>  3. "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS
>     implementation that is less picky about alignment and where STOSB
>     (1 byte at a time) is actually faster than STOSQ (8 bytes at a
>     time).

Yeah this seems good to add. And, sorry should have threshed that comment
out in the new location instead of just getting rid of it.

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16 14:50     ` Peter Zijlstra
  2025-06-16 15:03       ` Dave Hansen
@ 2025-06-16 18:20       ` Ankur Arora
  1 sibling, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16 18:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp,
	dave.hansen, hpa, mingo, mjguzik, luto, acme, namhyung, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Jun 16, 2025 at 07:44:13AM -0700, Dave Hansen wrote:
>
>> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
>> x86 side that then cajoles the core mm/ code to use the fancy new
>> clear_pages_resched() implementation.
>
> Note that we should only set this bit with either full or lazy
> preemption selected. Haven't checked the patch-set to see if that
> constraint is already taken care of.

It is. I check for preempt_model_preemptible() and limit voluntary models
to chunk sizes of 8MB.

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing
  2025-06-16 15:06 ` [PATCH v4 00/13] x86/mm: " Dave Hansen
@ 2025-06-16 18:25   ` Ankur Arora
  2025-06-16 18:30     ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-06-16 18:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


Dave Hansen <dave.hansen@intel.com> writes:

> On 6/15/25 22:22, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages, improving on the
>> current page-at-a-time approach in two ways:
>>
>>  - amortizes the per-page setup cost over a larger extent
>>  - when using string instructions, exposes the real region size to the
>>    processor. A processor could use that as a hint to optimize based
>>    on the full extent size. AMD Zen uarchs, as an example, elide
>>    allocation of cachelines for regions larger than L3-size.
>
> Have you happened to do any testing outside of 'perf bench'?

Yeah. My original tests were with qemu creating a pinned guest (where it
would go and touch pages after allocation.)

I think perf bench is a reasonably good test is because a lot of demand
faulting often just boils down to the same kind of loop. And of course
MAP_POPULATE is essentially equal to the clearing loop in the kernel.

I'm happy to try other tests if you have some in mind.

And, thanks for the quick comments!

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing
  2025-06-16 18:25   ` Ankur Arora
@ 2025-06-16 18:30     ` Dave Hansen
  2025-06-16 18:43       ` Ankur Arora
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2025-06-16 18:30 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On 6/16/25 11:25, Ankur Arora wrote:
> I'm happy to try other tests if you have some in mind.

I'd just want to make sure that the normal 4k clear_page() users aren't
seeing anything weird.

A good old kernel compile would be fine.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing
  2025-06-16 18:30     ` Dave Hansen
@ 2025-06-16 18:43       ` Ankur Arora
  0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-16 18:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


Dave Hansen <dave.hansen@intel.com> writes:

> On 6/16/25 11:25, Ankur Arora wrote:
>> I'm happy to try other tests if you have some in mind.
>
> I'd just want to make sure that the normal 4k clear_page() users aren't
> seeing anything weird.
>
> A good old kernel compile would be fine.

Makes sense. I can do that both with and without THP.

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16 14:44   ` Dave Hansen
  2025-06-16 14:50     ` Peter Zijlstra
  2025-06-16 14:58     ` Matthew Wilcox
@ 2025-06-16 18:47     ` Ankur Arora
  2025-06-19 23:51       ` Ankur Arora
  2 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-06-16 18:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


Dave Hansen <dave.hansen@intel.com> writes:

> On 6/15/25 22:22, Ankur Arora wrote:
>> Override the common code version of folio_zero_user() so we can use
>> clear_pages() to do multi-page clearing instead of the standard
>> page-at-a-time clearing.
>
> I'm not a big fan of the naming in this series.
>
> To me multi-page means "more than one 'struct page'". But this series is
> clearly using multi-page clearing to mean clearing >PAGE_SIZE in one
> clear. But oh well.

I'd say it's doing both of those. Seen from the folio side, it is
clearing more than one struct page.

Once you descend to the clearing primitive, that's just page aligned
memory.

> The second problem with where this ends up is that none of the code is
> *actually* x86-specific. The only thing that x86 provides that's
> interesting is a clear_pages() implementation that hands >PAGE_SIZE
> units down to the CPUs.
>
> The result is ~100 lines of code that will compile and run functionally
> on any architecture.

True. The underlying assumption is that you can provide extent level
information to string instructions which AFAIK only exists on x86.

> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
> x86 side that then cajoles the core mm/ code to use the fancy new
> clear_pages_resched() implementation.

This seems straight-forward enough.

> Because what are the arm64 guys going to do when their CPUs start doing
> this? They're either going to copy-and-paste the x86 implementation or
> they're going to go move the refactor the x86 implementation into common
> code.

These instructions have been around for an awfully long time. Are other
architectures looking at adding similar instructions?

I think this is definitely worth if there are performance advantages on
arm64 -- maybe just because of the reduced per-page overhead.

Let me try this out on arm64.

> My money is on the refactoring, because those arm64 guys do good work.
> Could we save them the trouble, please?

> Oh, and one other little thing:
>
>> +/*
>> + * Limit the optimized version of folio_zero_user() to !CONFIG_HIGHMEM.
>> + * We do that because clear_pages() works on contiguous kernel pages
>> + * which might not be true under HIGHMEM.
>> + */
>
> The tip trees are picky about imperative voice, so no "we's". But if you
> stick this in mm/, folks are less picky. ;)

Hah. That might be come in handy ;).

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
  2025-06-16 18:47     ` Ankur Arora
@ 2025-06-19 23:51       ` Ankur Arora
  0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-06-19 23:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


Ankur Arora <ankur.a.arora@oracle.com> writes:

> Dave Hansen <dave.hansen@intel.com> writes:
>
>> On 6/15/25 22:22, Ankur Arora wrote:

[ ... ]

>> The second problem with where this ends up is that none of the code is
>> *actually* x86-specific. The only thing that x86 provides that's
>> interesting is a clear_pages() implementation that hands >PAGE_SIZE
>> units down to the CPUs.
>>
>> The result is ~100 lines of code that will compile and run functionally
>> on any architecture.
>
> True. The underlying assumption is that you can provide extent level
> information to string instructions which AFAIK only exists on x86.
>
>> To me, that's deserving of an ARCH_HAS_FOO bit that we can set on the
>> x86 side that then cajoles the core mm/ code to use the fancy new
>> clear_pages_resched() implementation.
>
> This seems straight-forward enough.
>
>> Because what are the arm64 guys going to do when their CPUs start doing
>> this? They're either going to copy-and-paste the x86 implementation or
>> they're going to go move the refactor the x86 implementation into common
>> code.
>
> These instructions have been around for an awfully long time. Are other
> architectures looking at adding similar instructions?

Just to answer my own question: arm64 with FEAT_MOPS (post v8.8) does
support operating on memory extents. (Both clearing and copying.)

> I think this is definitely worth if there are performance advantages on
> arm64 -- maybe just because of the reduced per-page overhead.
>
> Let me try this out on arm64.
>
>> My money is on the refactoring, because those arm64 guys do good work.
>> Could we save them the trouble, please?

I thought about this and this definitely makes sense to do. But, it
really suggests a larger set of refactors:

1. hugepage clearing via clear_pages() (this series)
2. hugepage copying via copy_pages()

Both of these are faster than the current per page approach on x86. And,
from some preliminary tests, at least no slower no arm64.
(My arm64 test machine does not have the FEAT_MOPS.)

With those two done we should be able to simplify the current
folio_zero_user(), copy_user_large_folio(), process_huge_page() which
is overcomplicated. Other archs that care about performance could
switch to the multiple page approach.

3. Simplify the logic around process_huge_page().

None of these pieces are overly complex. I think the only question is
how to stage it.

Ideally I would like to stage them sequentially and not send out a
single unwieldy series that touches mm and has performance implications
for multiple architectures.

Also would be good to get wider testing for each part.

What do you think? I guess this is also a question for Andrew.

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing
  2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
                   ` (13 preceding siblings ...)
  2025-06-16 15:06 ` [PATCH v4 00/13] x86/mm: " Dave Hansen
@ 2025-07-04  8:15 ` Raghavendra K T
  2025-07-07 21:02   ` Ankur Arora
  14 siblings, 1 reply; 32+ messages in thread
From: Raghavendra K T @ 2025-07-04  8:15 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, jon.grimm, bharata, boris.ostrovsky,
	konrad.wilk


On 6/16/2025 10:52 AM, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages, improving on the
> current page-at-a-time approach in two ways:
> 
>   - amortizes the per-page setup cost over a larger extent
>   - when using string instructions, exposes the real region size to the
>     processor. A processor could use that as a hint to optimize based
>     on the full extent size. AMD Zen uarchs, as an example, elide
>     allocation of cachelines for regions larger than L3-size.
> 
> Demand faulting a 64GB region shows good performance improvements:
> 
>   $ perf bench mem map -p $page-size -f demand -s 64GB -l 5
> 
>                   mm/folio_zero_user    x86/folio_zero_user       change
>                    (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>    pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%
>    pg-sz=1GB       17.51  +- 1.19%        40.03  +-  7.26% [#]   +129.9%
> 
> [#] Only with preempt=full|lazy because cooperatively preempted models
> need regular invocations of cond_resched(). This limits the extent
> sizes that can be cleared as a unit.
> 
> Raghavendra also tested on AMD Genoa and that shows similar
> improvements [1].
> 
[...]
Sorry for coming back late on this:
It was nice to have it integrated to perf bench mem (easy to test :)).

I do see similar (almost same) improvement again with the rebased kernel
and patchset.
Tested only preempt=lazy and boost=1

base       6.16-rc4 + 1-9 patches of this series
patched =  6.16-rc4 + all patches

SUT: Genoa+ AMD EPYC 9B24

  $ perf bench mem map -p $page-size -f populate -s 64GB -l 10
                    base               patched              change
   pg-sz=2MB       12.731939 GB/sec    26.304263 GB/sec     106.6%
   pg-sz=1GB       26.232423 GB/sec    61.174836 GB/sec     133.2%

for 4kb page size there is a slight improvement (mostly a noise).

Thanks and Regards
- Raghu



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing
  2025-07-04  8:15 ` Raghavendra K T
@ 2025-07-07 21:02   ` Ankur Arora
  0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-07-07 21:02 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	jon.grimm, bharata, boris.ostrovsky, konrad.wilk


Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 6/16/2025 10:52 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages, improving on the
>> current page-at-a-time approach in two ways:
>>   - amortizes the per-page setup cost over a larger extent
>>   - when using string instructions, exposes the real region size to the
>>     processor. A processor could use that as a hint to optimize based
>>     on the full extent size. AMD Zen uarchs, as an example, elide
>>     allocation of cachelines for regions larger than L3-size.
>> Demand faulting a 64GB region shows good performance improvements:
>>   $ perf bench mem map -p $page-size -f demand -s 64GB -l 5
>>                   mm/folio_zero_user    x86/folio_zero_user       change
>>                    (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>    pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%
>>    pg-sz=1GB       17.51  +- 1.19%        40.03  +-  7.26% [#]   +129.9%
>> [#] Only with preempt=full|lazy because cooperatively preempted models
>> need regular invocations of cond_resched(). This limits the extent
>> sizes that can be cleared as a unit.
>> Raghavendra also tested on AMD Genoa and that shows similar
>> improvements [1].
>>
> [...]
> Sorry for coming back late on this:
> It was nice to have it integrated to perf bench mem (easy to test :)).
>
> I do see similar (almost same) improvement again with the rebased kernel
> and patchset.
> Tested only preempt=lazy and boost=1
>
> base       6.16-rc4 + 1-9 patches of this series
> patched =  6.16-rc4 + all patches
>
> SUT: Genoa+ AMD EPYC 9B24
>
>  $ perf bench mem map -p $page-size -f populate -s 64GB -l 10
>                    base               patched              change
>   pg-sz=2MB       12.731939 GB/sec    26.304263 GB/sec     106.6%
>   pg-sz=1GB       26.232423 GB/sec    61.174836 GB/sec     133.2%

Thanks for trying them out. Looks great.

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-07-07 21:05 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16  5:22 [PATCH v4 00/13] x86/mm: Add multi-page clearing Ankur Arora
2025-06-16  5:22 ` [PATCH v4 01/13] perf bench mem: Remove repetition around time measurement Ankur Arora
2025-06-16  5:22 ` [PATCH v4 02/13] perf bench mem: Defer type munging of size to float Ankur Arora
2025-06-16  5:22 ` [PATCH v4 03/13] perf bench mem: Move mem op parameters into a structure Ankur Arora
2025-06-16  5:22 ` [PATCH v4 04/13] perf bench mem: Pull out init/fini logic Ankur Arora
2025-06-16  5:22 ` [PATCH v4 05/13] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
2025-06-16  5:22 ` [PATCH v4 06/13] perf bench mem: Allow mapping of hugepages Ankur Arora
2025-06-16  5:22 ` [PATCH v4 07/13] perf bench mem: Allow chunking on a memory region Ankur Arora
2025-06-16  5:22 ` [PATCH v4 08/13] perf bench mem: Refactor mem_options Ankur Arora
2025-06-16  5:22 ` [PATCH v4 09/13] perf bench mem: Add mmap() workloads Ankur Arora
2025-06-16  5:22 ` [PATCH v4 10/13] x86/mm: Simplify clear_page_* Ankur Arora
2025-06-16 14:35   ` Dave Hansen
2025-06-16 14:38     ` Peter Zijlstra
2025-06-16 18:18     ` Ankur Arora
2025-06-16 16:48   ` kernel test robot
2025-06-16  5:22 ` [PATCH v4 11/13] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-06-16  5:22 ` [PATCH v4 12/13] mm: memory: allow arch override for folio_zero_user() Ankur Arora
2025-06-16  5:22 ` [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing Ankur Arora
2025-06-16 11:39   ` kernel test robot
2025-06-16 14:44   ` Dave Hansen
2025-06-16 14:50     ` Peter Zijlstra
2025-06-16 15:03       ` Dave Hansen
2025-06-16 18:20       ` Ankur Arora
2025-06-16 14:58     ` Matthew Wilcox
2025-06-16 18:47     ` Ankur Arora
2025-06-19 23:51       ` Ankur Arora
2025-06-16 15:06 ` [PATCH v4 00/13] x86/mm: " Dave Hansen
2025-06-16 18:25   ` Ankur Arora
2025-06-16 18:30     ` Dave Hansen
2025-06-16 18:43       ` Ankur Arora
2025-07-04  8:15 ` Raghavendra K T
2025-07-07 21:02   ` Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).