[PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
@ 2026-03-20 20:42 Waiman Long
  2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Waiman Long @ 2026-03-20 20:42 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

 v2:
  - Change vmstats flush threshold scaling from ilog2() to int_sqrt() as
    suggested by Li Wang
  - Fix a number of issues reported by AI review of the patch series

There are a number of test failures with the running of the
test_memcontrol selftest on a 128-core arm64 system on kernels with
4k/16k/64k page sizes. This patch series makes some minor changes to
the kernel and the test_memcontrol selftest to address these failures.

The first kernel patch scales the memcg vmstats flush threshold
with int_sqrt() instead of linearly with the total number of CPUs. The
second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
size. These 2 patches help to reduce the discrepancies between the
reported usage data with the real ones.

The next 5 test_memcontrol selftest patches adjust the testing code to
greatly reduce the chance that it will report failure, though some
occasional failures is still possible.

To verify the changes, the test_memcontrol selftest was run 100
times each on a 128-core arm64 system on kernels with 4k/16k/64k
page sizes.  No failure was observed other than some failures of the
test_memcg_reclaim test when running on a 16k page size kernel. The
reclaim_until() call failed because of the unexpected over-reclaim of
memory. This will need a further look but it happens with the 16k page
size kernel only and I don't have a production ready kernel config file
to use in buildinig this 16k page size kernel. The new test_memcontrol
selftest and kernel were also run on a 96-core x86 system to make sure
there was no regression.

Waiman Long (7):
  memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
  memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  selftests: memcg: Iterate pages based on the actual page size
  selftests: memcg: Increase error tolerance in accordance with page
    size
  selftests: memcg: Reduce the expected swap.peak with larger page size
  selftests: memcg: Don't call reclaim_until() if already in target
  selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as
    XFAIL

 include/linux/memcontrol.h                    |  8 +-
 mm/memcontrol.c                               | 18 ++--
 .../cgroup/lib/include/cgroup_util.h          |  3 +-
 .../selftests/cgroup/test_memcontrol.c        | 87 +++++++++++++++----
 4 files changed, 93 insertions(+), 23 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
  2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
@ 2026-03-20 20:42 ` Waiman Long
  2026-03-23 12:46   ` Li Wang
  2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-03-20 20:42 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long, Li Wang

The vmstats flush threshold currently increases linearly with the
number of online CPUs. As the number of CPUs increases over time, it
will become increasingly difficult to meet the threshold and update the
vmstats data in a timely manner. These days, systems with hundreds of
CPUs or even thousands of them are becoming more common.

For example, the test_memcg_sock test of test_memcontrol always fails
when running on an arm64 system with 128 CPUs. It is because the
threshold is now 64*128 = 8192. With 4k page size, it needs changes in
32 MB of memory. It will be even worse with larger page size like 64k.

To make the output of memory.stat more correct, it is better to scale
up the threshold slower than linearly with the number of CPUs. The
int_sqrt() function is a good compromise as suggested by Li Wang [1].
An extra 2 is added to make sure that we will double the threshold for
a 2-core system. The increase will be slower after that.

With the int_sqrt() scale, we can use the possibly larger
num_possible_cpus() instead of num_online_cpus() which may change at
run time.

Although there is supposed to be a periodic and asynchronous flush of
vmstats every 2 seconds, the actual time lag between succesive runs
can actually vary quite a bit. In fact, I have seen time lags of up
to 10s of seconds in some cases. So we couldn't too rely on the hope
that there will be an asynchronous vmstats flush every 2 seconds. This
may be something we need to look into.

[1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/

Suggested-by: Li Wang <liwang@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..cc1fc0f5aeea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -548,20 +548,20 @@ struct memcg_vmstats {
  *    rstat update tree grow unbounded.
  *
  * 2) Flush the stats synchronously on reader side only when there are more than
- *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
- *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
- *    only for 2 seconds due to (1).
+ *    (MEMCG_CHARGE_BATCH * int_sqrt(nr_cpus+2)) update events. Though this
+ *    optimization will let stats be out of sync by up to that amount. This is
+ *    supposed to last for up to 2 seconds due to (1).
  */
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
 static u64 flush_last_time;
+static int vmstats_flush_threshold __ro_after_init;

 #define FLUSH_TIME (2UL*HZ)

 static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
 {
-	return atomic_read(&vmstats->stats_updates) >
-		MEMCG_CHARGE_BATCH * num_online_cpus();
+	return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
 }

 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
@@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)

 	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
 				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
+	/*
+	 * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
+	 * 2 constant is to make sure that the threshold is double for a 2-core
+	 * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
+	 * number of the CPUs reaches the next (2^n - 2) value.
+	 */
+	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
+				  (int_sqrt(num_possible_cpus() + 2));

 	return 0;
 }
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
  2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
@ 2026-03-20 20:42 ` Waiman Long
  2026-03-23 12:47   ` Li Wang
  2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-03-20 20:42 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

For a system with 4k page size, each percpu memcg_stock can hide up
to 256 kbytes of memory with the current MEMCG_CHARGE_BATCH value of
64. For another system with 64k page size, that becomes 4 Mbytes. This
hidden charges will affect the accurary of the memory.current value.

This MEMCG_CHARGE_BATCH value also controls how often should the
memcg vmstat values need flushing. As a result, the values reported
in memory.stat cgroup control files are less indicative of the actual
memory consumption of a particular memory cgroup when the page size
increases from 4k.

This problem can be illustrated by running the test_memcontrol
selftest. Running a 4k page size kernel on a 128-core arm64 system,
the test_memcg_current_peak test which allocates a 50M anonymous memory
passed. With a 64k page size kernel on the same system, however, the
same test failed because the "anon" attribute of memory.stat file might
report a size of 0 depending on the number of CPUs the system has.

To solve this inaccurate memory stats problem, we need to scale down
the amount of memory that can be hidden by reducing MEMCG_CHARGE_BATCH
when the page size increases. The same user application will likely
consume more memory on systems with larger page size and it is also
less efficient if we scale down MEMCG_CHARGE_BATCH by too much.  So I
believe a good compromise is to scale down MEMCG_CHARGE_BATCH by 2 for
16k page size and by 4 with 64k page size.

With that change, the test_memcg_current_peak test passed again with
the modified 64k page size kernel.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70b685a85bf4..748cfd75d998 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -328,8 +328,14 @@ struct mem_cgroup {
  * size of first charge trial.
  * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
  * workload.
+ *
+ * There are 3 common base page sizes - 4k, 16k & 64k. In order to limit the
+ * amount of memory that can be hidden in each percpu memcg_stock for a given
+ * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
  */
-#define MEMCG_CHARGE_BATCH 64U
+#define MEMCG_CHARGE_BATCH_BASE  64U
+#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT - 12)/2 : 2)
+#define MEMCG_CHARGE_BATCH	 (MEMCG_CHARGE_BATCH_BASE >> MEMCG_CHARGE_BATCH_SHIFT)

 extern struct mem_cgroup *root_mem_cgroup;

-- 
2.53.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size
  2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
  2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
  2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
@ 2026-03-20 20:42 ` Waiman Long
  2026-03-23  2:53   ` Li Wang
  2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-03-20 20:42 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long, Li Wang

The current test_memcontrol test fault in memory by write a value
to the start of a page based on the default value of 4k page size.
Micro-optimize it by using the actual system page size to do the
iteration.

Reviewed-by: Li Wang <liwang@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/test_memcontrol.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index a25eb097b31c..babbfad10aaf 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -25,6 +25,7 @@
 
 static bool has_localevents;
 static bool has_recursiveprot;
+static int page_size;
 
 int get_temp_fd(void)
 {
@@ -60,7 +61,7 @@ int alloc_anon(const char *cgroup, void *arg)
 	char *buf, *ptr;
 
 	buf = malloc(size);
-	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+	for (ptr = buf; ptr < buf + size; ptr += page_size)
 		*ptr = 0;
 
 	free(buf);
@@ -183,7 +184,7 @@ static int alloc_anon_50M_check(const char *cgroup, void *arg)
 		return -1;
 	}
 
-	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+	for (ptr = buf; ptr < buf + size; ptr += page_size)
 		*ptr = 0;
 
 	current = cg_read_long(cgroup, "memory.current");
@@ -413,7 +414,7 @@ static int alloc_anon_noexit(const char *cgroup, void *arg)
 		return -1;
 	}
 
-	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+	for (ptr = buf; ptr < buf + size; ptr += page_size)
 		*ptr = 0;
 
 	while (getppid() == ppid)
@@ -999,7 +1000,7 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
 		return -1;
 	}
 
-	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+	for (ptr = buf; ptr < buf + size; ptr += page_size)
 		*ptr = 0;
 
 	mem_current = cg_read_long(cgroup, "memory.current");
@@ -1679,6 +1680,10 @@ int main(int argc, char **argv)
 	char root[PATH_MAX];
 	int i, proc_status;
 
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (page_size <= 0)
+		page_size = PAGE_SIZE;
+
 	ksft_print_header();
 	ksft_set_plan(ARRAY_SIZE(tests));
 	if (cg_find_unified_root(root, sizeof(root), NULL))
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with page size
  2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (2 preceding siblings ...)
  2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
@ 2026-03-20 20:42 ` Waiman Long
  2026-03-23  8:01   ` Li Wang
  2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-03-20 20:42 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

It was found that some of the tests in test_memcontrol can fail more
readily if system page size is larger than 4k. It is because the
actual memory.current value deviates more from the expected value with
larger page size. This is likely due to the fact there may be up to
MEMCG_CHARGE_BATCH pages of charge hidden in each one of the percpu
memcg_stock.

To avoid this failure, the error tolerance is now increased in accordance
to the current system page size value. The page size scale factor is
set to 2 for 64k page and 1 for 16k page.

Changes are made in alloc_pagecache_max_30M(), test_memcg_protection()
and alloc_anon_50M_check_swap() to increase the error tolerance for
memory.current for larger page size. The current set of values are
chosen to ensure that the relevant test_memcontrol tests no longer
have any test failure in a 100 repeated run of test_memcontrol with a
4k/16k/64k page size kernels on an arm64 system.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 .../cgroup/lib/include/cgroup_util.h          |  3 ++-
 .../selftests/cgroup/test_memcontrol.c        | 23 ++++++++++++++-----
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
index 77f386dab5e8..2293e770e9b4 100644
--- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
@@ -6,7 +6,8 @@
 #define PAGE_SIZE 4096
 #endif
 
-#define MB(x) (x << 20)
+#define KB(x) ((x) << 10)
+#define MB(x) ((x) << 20)
 
 #define USEC_PER_SEC	1000000L
 #define NSEC_PER_SEC	1000000000L
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index babbfad10aaf..c078fc458def 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -26,6 +26,7 @@
 static bool has_localevents;
 static bool has_recursiveprot;
 static int page_size;
+static int pscale_factor;	/* Page size scale factor */
 
 int get_temp_fd(void)
 {
@@ -571,16 +572,17 @@ static int test_memcg_protection(const char *root, bool min)
 	if (cg_run(parent[2], alloc_anon, (void *)MB(148)))
 		goto cleanup;
 
-	if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50), 3))
+	if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50),
+				       3 + (min ? 0 : 4) * pscale_factor))
 		goto cleanup;
 
 	for (i = 0; i < ARRAY_SIZE(children); i++)
 		c[i] = cg_read_long(children[i], "memory.current");
 
-	if (!values_close(c[0], MB(29), 15))
+	if (!values_close(c[0], MB(29), 15 + 3 * pscale_factor))
 		goto cleanup;
 
-	if (!values_close(c[1], MB(21), 20))
+	if (!values_close(c[1], MB(21), 20 + pscale_factor))
 		goto cleanup;
 
 	if (c[3] != 0)
@@ -596,7 +598,8 @@ static int test_memcg_protection(const char *root, bool min)
 	}
 
 	current = min ? MB(50) : MB(30);
-	if (!values_close(cg_read_long(parent[1], "memory.current"), current, 3))
+	if (!values_close(cg_read_long(parent[1], "memory.current"), current,
+				       9 + (min ? 0 : 6) * pscale_factor))
 		goto cleanup;
 
 	if (!reclaim_until(children[0], MB(10)))
@@ -684,7 +687,7 @@ static int alloc_pagecache_max_30M(const char *cgroup, void *arg)
 		goto cleanup;
 
 	current = cg_read_long(cgroup, "memory.current");
-	if (!values_close(current, MB(30), 5))
+	if (!values_close(current, MB(30), 5 + (pscale_factor ? 2 : 0)))
 		goto cleanup;
 
 	ret = 0;
@@ -1004,7 +1007,7 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
 		*ptr = 0;
 
 	mem_current = cg_read_long(cgroup, "memory.current");
-	if (!mem_current || !values_close(mem_current, mem_max, 3))
+	if (!mem_current || !values_close(mem_current, mem_max, 6 + pscale_factor))
 		goto cleanup;
 
 	swap_current = cg_read_long(cgroup, "memory.swap.current");
@@ -1684,6 +1687,14 @@ int main(int argc, char **argv)
 	if (page_size <= 0)
 		page_size = PAGE_SIZE;
 
+	/*
+	 * It is found that the actual memory.current value can deviate more
+	 * from the expected value with larger page size. So error tolerance
+	 * will have to be increased a bit more for larger page size.
+	 */
+	if (page_size > KB(4))
+		pscale_factor = (page_size >= KB(64)) ? 2 : 1;
+
 	ksft_print_header();
 	ksft_set_plan(ARRAY_SIZE(tests));
 	if (cg_find_unified_root(root, sizeof(root), NULL))
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger page size
  2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (3 preceding siblings ...)
  2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
@ 2026-03-20 20:42 ` Waiman Long
  2026-03-23  8:24   ` Li Wang
  2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-03-20 20:42 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

When running the test_memcg_swap_max_peak test which sets swap.max
to 30M on an arm64 system with 64k page size, the test failed as the
swap.peak could only reach up only to 27,328,512 bytes (about 25.45
MB which is lower than the expected 29M) before the allocating task
got oom-killed.

It is likely due to the fact that it takes longer to write out a larger
page to swap and hence a lower swap.peak is being reached. Setting
memory.high to 29M to throttle memory allocation when nearing memory.max
helps, but it still could only reach up to 29,032,448 bytes (about
27.04M). As a result, we have to reduce the expected swap.peak with
larger page size. Now swap.peak is expected to reach only 27M with 64k
page, 29M with 4k page and 28M with 16k page.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 .../selftests/cgroup/test_memcontrol.c        | 26 ++++++++++++++++---
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c078fc458def..3832ded1e47b 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1032,6 +1032,7 @@ static int test_memcg_swap_max_peak(const char *root)
 	char *memcg;
 	long max, peak;
 	struct stat ss;
+	long swap_peak;
 	int swap_peak_fd = -1, mem_peak_fd = -1;
 
 	/* any non-empty string resets */
@@ -1119,6 +1120,23 @@ static int test_memcg_swap_max_peak(const char *root)
 	if (cg_write(memcg, "memory.max", "30M"))
 		goto cleanup;
 
+	/*
+	 * The swap.peak that can be reached will depend on the system page
+	 * size. With larger page size (e.g. 64k), it takes more time to write
+	 * the anonymous memory page to swap and so the peak reached will be
+	 * lower before the memory allocation process get oom-killed. One way
+	 * to allow the swap.peak to go higher is to throttle memory allocation
+	 * by setting memory.high to, say, 29M to give more time to swap out the
+	 * memory before oom-kill. This is still not enough for it to reach
+	 * 29M reachable with 4k page. So we still need to reduce the expected
+	 * swap.peak accordingly.
+	 */
+	swap_peak = (page_size == KB(4)) ? MB(29) :
+		   ((page_size <= KB(16)) ? MB(28) : MB(27));
+
+	if (cg_write(memcg, "memory.high", "29M"))
+		goto cleanup;
+
 	/* Should be killed by OOM killer */
 	if (!cg_run(memcg, alloc_anon, (void *)MB(100)))
 		goto cleanup;
@@ -1134,7 +1152,7 @@ static int test_memcg_swap_max_peak(const char *root)
 		goto cleanup;
 
 	peak = cg_read_long(memcg, "memory.swap.peak");
-	if (peak < MB(29))
+	if (peak < swap_peak)
 		goto cleanup;
 
 	peak = cg_read_long_fd(mem_peak_fd);
@@ -1142,7 +1160,7 @@ static int test_memcg_swap_max_peak(const char *root)
 		goto cleanup;
 
 	peak = cg_read_long_fd(swap_peak_fd);
-	if (peak < MB(29))
+	if (peak < swap_peak)
 		goto cleanup;
 
 	/*
@@ -1181,7 +1199,7 @@ static int test_memcg_swap_max_peak(const char *root)
 	if (cg_read_long(memcg, "memory.peak") < MB(29))
 		goto cleanup;
 
-	if (cg_read_long(memcg, "memory.swap.peak") < MB(29))
+	if (cg_read_long(memcg, "memory.swap.peak") < swap_peak)
 		goto cleanup;
 
 	if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30)))
@@ -1196,7 +1214,7 @@ static int test_memcg_swap_max_peak(const char *root)
 		goto cleanup;
 
 	peak = cg_read_long(memcg, "memory.swap.peak");
-	if (peak < MB(29))
+	if (peak < swap_peak)
 		goto cleanup;
 
 	peak = cg_read_long_fd(mem_peak_fd);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target
  2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (4 preceding siblings ...)
  2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
@ 2026-03-20 20:42 ` Waiman Long
  2026-03-23  8:53   ` Li Wang
  2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
  2026-03-21  1:16 ` [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
  7 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-03-20 20:42 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

Near the end of test_memcg_protection(), reclaim_until() is called
to reduce memory.current of children[0] to 10M. It was found that
with larger page size (e.g. 64k) the various memory cgroups in
test_memcg_protection() would deviate further from the expected values
especially for the test_memcg_low test. As a result, children[0] might
have reached the target already without reclamation. The will cause the
reclaim_until() function to report failure as no reclamation is needed.

Avoid this unexpected failure by skipping the reclaim_until() call if
memory.current of children[0] has already reached the target size for
kernel with non-4k page size.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/test_memcontrol.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 3832ded1e47b..5336be5ed2f5 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -490,6 +490,7 @@ static int test_memcg_protection(const char *root, bool min)
 	long current;
 	int i, attempts;
 	int fd;
+	bool do_reclaim;
 
 	fd = get_temp_fd();
 	if (fd < 0)
@@ -602,7 +603,15 @@ static int test_memcg_protection(const char *root, bool min)
 				       9 + (min ? 0 : 6) * pscale_factor))
 		goto cleanup;
 
-	if (!reclaim_until(children[0], MB(10)))
+	/*
+	 * With larger page size, it is possible that memory.current of
+	 * children[0] is close to 10M. Skip the reclaim_until() call if
+	 * that is the case.
+	 */
+	current = cg_read_long(children[0], "memory.current");
+	do_reclaim = (page_size == KB(4)) ||
+		     ((current > MB(10)) && !values_close(current, MB(10), 3));
+	if (do_reclaim && !reclaim_until(children[0], MB(10)))
 		goto cleanup;
 
 	if (min) {
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL
  2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (5 preceding siblings ...)
  2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
@ 2026-03-20 20:42 ` Waiman Long
  2026-03-23  9:44   ` Li Wang
  2026-03-21  1:16 ` [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
  7 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-03-20 20:42 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

Although there is supposed to be a periodic and asynchronous flush of
stats every 2 seconds, the actual time lag between succesive runs can
actually vary quite a bit. In fact, I have seen time lag of up to 10s
of seconds in some cases.

At the end of test_memcg_sock, it waits up to 3 seconds for the
"sock" attribute of memory.stat to go back down to 0. Obviously it
may occasionally fail especially when the kernel has large page size
(e.g. 64k). Treat this failure as an expected failure (XFAIL) to
distinguish it from the other failure cases.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/test_memcontrol.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 5336be5ed2f5..af3e8fe4e50e 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1486,12 +1486,21 @@ static int test_memcg_sock(const char *root)
 	 * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
 	 * scheduling slack) and require that the "sock " counter
 	 * eventually drops to zero.
+	 *
+	 * The actual run-to-run elapse time between consecutive run
+	 * of asynchronous memcg rstat flush may varies quite a bit.
+	 * So the 3 seconds wait time may not be enough for the "sock"
+	 * counter to go down to 0. Treat it as a XFAIL instead of
+	 * a FAIL.
 	 */
 	sock_post = cg_read_key_long_poll(memcg, "memory.stat", "sock ", 0,
 					 MEMCG_SOCKSTAT_WAIT_RETRIES,
 					 DEFAULT_WAIT_INTERVAL_US);
-	if (sock_post)
+	if (sock_post) {
+		if (sock_post > 0)
+			ret = KSFT_XFAIL;
 		goto cleanup;
+	}
 
 	ret = KSFT_PASS;
 
@@ -1756,6 +1765,9 @@ int main(int argc, char **argv)
 		case KSFT_SKIP:
 			ksft_test_result_skip("%s\n", tests[i].name);
 			break;
+		case KSFT_XFAIL:
+			ksft_test_result_xfail("%s\n", tests[i].name);
+			break;
 		default:
 			ksft_test_result_fail("%s\n", tests[i].name);
 			break;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
  2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (6 preceding siblings ...)
  2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
@ 2026-03-21  1:16 ` Andrew Morton
  7 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2026-03-21  1:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Tejun Heo, Michal Koutný, Shuah Khan,
	Mike Rapoport, linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang

On Fri, 20 Mar 2026 16:42:34 -0400 Waiman Long <longman@redhat.com> wrote:

> There are a number of test failures with the running of the
> test_memcontrol selftest on a 128-core arm64 system on kernels with
> 4k/16k/64k page sizes. This patch series makes some minor changes to
> the kernel and the test_memcontrol selftest to address these failures.
> 
> The first kernel patch scales the memcg vmstats flush threshold
> with int_sqrt() instead of linearly with the total number of CPUs. The
> second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
> size. These 2 patches help to reduce the discrepancies between the
> reported usage data with the real ones.
> 
> The next 5 test_memcontrol selftest patches adjust the testing code to
> greatly reduce the chance that it will report failure, though some
> occasional failures is still possible.

The AI review is up: https://sashiko.dev/#/patchset/20260320204241.1613861-1-longman@redhat.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size
  2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
@ 2026-03-23  2:53   ` Li Wang
  2026-03-23  2:56     ` Li Wang
  2026-03-25  3:33     ` Waiman Long
  0 siblings, 2 replies; 24+ messages in thread
From: Li Wang @ 2026-03-23  2:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang

[-- Attachment #1: Type: text/plain, Size: 3156 bytes --]

Hi Waiman,

I currently have another patch in hand that is functionally identical to
patch 3/7 here (requested by Sashiko).

May I go ahead and merge it directly into my patch series, while
retaining your authorship attribution?

Otherwise, our respective patches will conflict when merged into the
mm-tree.

See:
https://sashiko.dev/#/patchset/20260322061038.156146-1-liwang@redhat.com


On Sat, Mar 21, 2026 at 4:43 AM Waiman Long <longman@redhat.com> wrote:

> The current test_memcontrol test fault in memory by write a value
> to the start of a page based on the default value of 4k page size.
> Micro-optimize it by using the actual system page size to do the
> iteration.
>
> Reviewed-by: Li Wang <liwang@redhat.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  tools/testing/selftests/cgroup/test_memcontrol.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c
> b/tools/testing/selftests/cgroup/test_memcontrol.c
> index a25eb097b31c..babbfad10aaf 100644
> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> @@ -25,6 +25,7 @@
>
>  static bool has_localevents;
>  static bool has_recursiveprot;
> +static int page_size;
>
>  int get_temp_fd(void)
>  {
> @@ -60,7 +61,7 @@ int alloc_anon(const char *cgroup, void *arg)
>         char *buf, *ptr;
>
>         buf = malloc(size);
> -       for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> +       for (ptr = buf; ptr < buf + size; ptr += page_size)
>                 *ptr = 0;
>
>         free(buf);
> @@ -183,7 +184,7 @@ static int alloc_anon_50M_check(const char *cgroup,
> void *arg)
>                 return -1;
>         }
>
> -       for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> +       for (ptr = buf; ptr < buf + size; ptr += page_size)
>                 *ptr = 0;
>
>         current = cg_read_long(cgroup, "memory.current");
> @@ -413,7 +414,7 @@ static int alloc_anon_noexit(const char *cgroup, void
> *arg)
>                 return -1;
>         }
>
> -       for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> +       for (ptr = buf; ptr < buf + size; ptr += page_size)
>                 *ptr = 0;
>
>         while (getppid() == ppid)
> @@ -999,7 +1000,7 @@ static int alloc_anon_50M_check_swap(const char
> *cgroup, void *arg)
>                 return -1;
>         }
>
> -       for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> +       for (ptr = buf; ptr < buf + size; ptr += page_size)
>                 *ptr = 0;
>
>         mem_current = cg_read_long(cgroup, "memory.current");
> @@ -1679,6 +1680,10 @@ int main(int argc, char **argv)
>         char root[PATH_MAX];
>         int i, proc_status;
>
> +       page_size = sysconf(_SC_PAGE_SIZE);
> +       if (page_size <= 0)
> +               page_size = PAGE_SIZE;
> +
>         ksft_print_header();
>         ksft_set_plan(ARRAY_SIZE(tests));
>         if (cg_find_unified_root(root, sizeof(root), NULL))
> --
> 2.53.0
>
>

-- 
Regards,
Li Wang

[-- Attachment #2: Type: text/html, Size: 4823 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size
  2026-03-23  2:53   ` Li Wang
@ 2026-03-23  2:56     ` Li Wang
  2026-03-25  3:33     ` Waiman Long
  1 sibling, 0 replies; 24+ messages in thread
From: Li Wang @ 2026-03-23  2:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]

On Mon, Mar 23, 2026 at 10:53 AM Li Wang <liwang@redhat.com> wrote:

> Hi Waiman,
>
> I currently have another patch in hand that is functionally identical to
> patch 3/7 here (requested by Sashiko).
>
> May I go ahead and merge it directly into my patch series, while
> retaining your authorship attribution?
>
> Otherwise, our respective patches will conflict when merged into the
> mm-tree.
>
> See:
> https://sashiko.dev/#/patchset/20260322061038.156146-1-liwang@redhat.com
>

The overlap patch:
  https://lore.kernel.org/all/20260313043532.103987-4-liwang@redhat.com/

-- 
Regards,
Li Wang

[-- Attachment #2: Type: text/html, Size: 1844 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with page size
  2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
@ 2026-03-23  8:01   ` Li Wang
  2026-03-25 16:42     ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Li Wang @ 2026-03-23  8:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On Fri, Mar 20, 2026 at 04:42:38PM -0400, Waiman Long wrote:
> It was found that some of the tests in test_memcontrol can fail more
> readily if system page size is larger than 4k. It is because the
> actual memory.current value deviates more from the expected value with
> larger page size. This is likely due to the fact there may be up to
> MEMCG_CHARGE_BATCH pages of charge hidden in each one of the percpu
> memcg_stock.
> 
> To avoid this failure, the error tolerance is now increased in accordance
> to the current system page size value. The page size scale factor is
> set to 2 for 64k page and 1 for 16k page.
> 
> Changes are made in alloc_pagecache_max_30M(), test_memcg_protection()
> and alloc_anon_50M_check_swap() to increase the error tolerance for
> memory.current for larger page size. The current set of values are
> chosen to ensure that the relevant test_memcontrol tests no longer
> have any test failure in a 100 repeated run of test_memcontrol with a
> 4k/16k/64k page size kernels on an arm64 system.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  .../cgroup/lib/include/cgroup_util.h          |  3 ++-
>  .../selftests/cgroup/test_memcontrol.c        | 23 ++++++++++++++-----
>  2 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
> index 77f386dab5e8..2293e770e9b4 100644
> --- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
> +++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
> @@ -6,7 +6,8 @@
>  #define PAGE_SIZE 4096
>  #endif
>  
> -#define MB(x) (x << 20)
> +#define KB(x) ((x) << 10)
> +#define MB(x) ((x) << 20)
>  
>  #define USEC_PER_SEC	1000000L
>  #define NSEC_PER_SEC	1000000000L
> diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> index babbfad10aaf..c078fc458def 100644
> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> @@ -26,6 +26,7 @@
>  static bool has_localevents;
>  static bool has_recursiveprot;
>  static int page_size;
> +static int pscale_factor;	/* Page size scale factor */
>  
>  int get_temp_fd(void)
>  {
> @@ -571,16 +572,17 @@ static int test_memcg_protection(const char *root, bool min)
>  	if (cg_run(parent[2], alloc_anon, (void *)MB(148)))
>  		goto cleanup;
>  
> -	if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50), 3))
> +	if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50),
> +				       3 + (min ? 0 : 4) * pscale_factor))
>  		goto cleanup;
>  
>  	for (i = 0; i < ARRAY_SIZE(children); i++)
>  		c[i] = cg_read_long(children[i], "memory.current");
>  
> -	if (!values_close(c[0], MB(29), 15))
> +	if (!values_close(c[0], MB(29), 15 + 3 * pscale_factor))
>  		goto cleanup;
>  
> -	if (!values_close(c[1], MB(21), 20))
> +	if (!values_close(c[1], MB(21), 20 + pscale_factor))
>  		goto cleanup;
>  
>  	if (c[3] != 0)
> @@ -596,7 +598,8 @@ static int test_memcg_protection(const char *root, bool min)
>  	}
>  
>  	current = min ? MB(50) : MB(30);
> -	if (!values_close(cg_read_long(parent[1], "memory.current"), current, 3))
> +	if (!values_close(cg_read_long(parent[1], "memory.current"), current,
> +				       9 + (min ? 0 : 6) * pscale_factor))
>  		goto cleanup;
>  
>  	if (!reclaim_until(children[0], MB(10)))
> @@ -684,7 +687,7 @@ static int alloc_pagecache_max_30M(const char *cgroup, void *arg)
>  		goto cleanup;
>  
>  	current = cg_read_long(cgroup, "memory.current");
> -	if (!values_close(current, MB(30), 5))
> +	if (!values_close(current, MB(30), 5 + (pscale_factor ? 2 : 0)))
>  		goto cleanup;
>  
>  	ret = 0;
> @@ -1004,7 +1007,7 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
>  		*ptr = 0;
>  
>  	mem_current = cg_read_long(cgroup, "memory.current");
> -	if (!mem_current || !values_close(mem_current, mem_max, 3))
> +	if (!mem_current || !values_close(mem_current, mem_max, 6 + pscale_factor))
>  		goto cleanup;
>  
>  	swap_current = cg_read_long(cgroup, "memory.swap.current");
> @@ -1684,6 +1687,14 @@ int main(int argc, char **argv)
>  	if (page_size <= 0)
>  		page_size = PAGE_SIZE;
>  
> +	/*
> +	 * It is found that the actual memory.current value can deviate more
> +	 * from the expected value with larger page size. So error tolerance
> +	 * will have to be increased a bit more for larger page size.
> +	 */
> +	if (page_size > KB(4))
> +		pscale_factor = (page_size >= KB(64)) ? 2 : 1;

This is a good improment but I still think the pscale_factor adjustments
are a bit fragile, each call site needs its own hand-tuned formula, and only
three page sizes (4K/16K/64K) are handled. If a new page size shows up,
every call site needs revisiting.

How about centralizing the page size adjustment inside values_close()
itself? Something like:

    static inline int values_close(long a, long b, int err)
    {
          ssize_t page_adjusted_err = ffs(page_size >> 13) + err;
    
          return 100 * labs(a - b) <= (a + b) * page_adjusted_err;
    }

This adds one extra percent of tolerance per doubling above 4K, scales
continuously for any power-of-two page size, and also fixes an integer
truncation issue in the original: (a + b) / 100 * err loses precision
when (a + b) < 100.

With this, the callers wouldn't need any changes at all.

This method is inspired from LTP:
  https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/memcontrol_common.h#L27

-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger page size
  2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
@ 2026-03-23  8:24   ` Li Wang
  2026-03-25  3:47     ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Li Wang @ 2026-03-23  8:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On Fri, Mar 20, 2026 at 04:42:39PM -0400, Waiman Long wrote:
> When running the test_memcg_swap_max_peak test which sets swap.max
> to 30M on an arm64 system with 64k page size, the test failed as the
> swap.peak could only reach up only to 27,328,512 bytes (about 25.45
> MB which is lower than the expected 29M) before the allocating task
> got oom-killed.
> 
> It is likely due to the fact that it takes longer to write out a larger
> page to swap and hence a lower swap.peak is being reached. Setting
> memory.high to 29M to throttle memory allocation when nearing memory.max
> helps, but it still could only reach up to 29,032,448 bytes (about
> 27.04M). As a result, we have to reduce the expected swap.peak with
> larger page size. Now swap.peak is expected to reach only 27M with 64k
> page, 29M with 4k page and 28M with 16k page.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  .../selftests/cgroup/test_memcontrol.c        | 26 ++++++++++++++++---
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> index c078fc458def..3832ded1e47b 100644
> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> @@ -1032,6 +1032,7 @@ static int test_memcg_swap_max_peak(const char *root)
>  	char *memcg;
>  	long max, peak;
>  	struct stat ss;
> +	long swap_peak;
>  	int swap_peak_fd = -1, mem_peak_fd = -1;
>  
>  	/* any non-empty string resets */
> @@ -1119,6 +1120,23 @@ static int test_memcg_swap_max_peak(const char *root)
>  	if (cg_write(memcg, "memory.max", "30M"))
>  		goto cleanup;
>  
> +	/*
> +	 * The swap.peak that can be reached will depend on the system page
> +	 * size. With larger page size (e.g. 64k), it takes more time to write
> +	 * the anonymous memory page to swap and so the peak reached will be
> +	 * lower before the memory allocation process get oom-killed. One way
> +	 * to allow the swap.peak to go higher is to throttle memory allocation
> +	 * by setting memory.high to, say, 29M to give more time to swap out the
> +	 * memory before oom-kill. This is still not enough for it to reach
> +	 * 29M reachable with 4k page. So we still need to reduce the expected
> +	 * swap.peak accordingly.
> +	 */
> +	swap_peak = (page_size == KB(4)) ? MB(29) :
> +		   ((page_size <= KB(16)) ? MB(28) : MB(27));

Or, go with a dynamic adjustment based on page size?

    swap_peak = MB(29) - ilog2(page_size / KB(4)) * MB(1);

-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target
  2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
@ 2026-03-23  8:53   ` Li Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Li Wang @ 2026-03-23  8:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On Fri, Mar 20, 2026 at 04:42:40PM -0400, Waiman Long wrote:
> Near the end of test_memcg_protection(), reclaim_until() is called
> to reduce memory.current of children[0] to 10M. It was found that
> with larger page size (e.g. 64k) the various memory cgroups in
> test_memcg_protection() would deviate further from the expected values
> especially for the test_memcg_low test. As a result, children[0] might
> have reached the target already without reclamation. The will cause the
> reclaim_until() function to report failure as no reclamation is needed.
> 
> Avoid this unexpected failure by skipping the reclaim_until() call if
> memory.current of children[0] has already reached the target size for
> kernel with non-4k page size.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Li Wang <liwang@redhat.com>

-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL
  2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
@ 2026-03-23  9:44   ` Li Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Li Wang @ 2026-03-23  9:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On Fri, Mar 20, 2026 at 04:42:41PM -0400, Waiman Long wrote:
> Although there is supposed to be a periodic and asynchronous flush of
> stats every 2 seconds, the actual time lag between succesive runs can
> actually vary quite a bit. In fact, I have seen time lag of up to 10s
> of seconds in some cases.
> 
> At the end of test_memcg_sock, it waits up to 3 seconds for the
> "sock" attribute of memory.stat to go back down to 0. Obviously it
> may occasionally fail especially when the kernel has large page size
> (e.g. 64k). Treat this failure as an expected failure (XFAIL) to
> distinguish it from the other failure cases.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  tools/testing/selftests/cgroup/test_memcontrol.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
> index 5336be5ed2f5..af3e8fe4e50e 100644
> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
> @@ -1486,12 +1486,21 @@ static int test_memcg_sock(const char *root)
>  	 * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
>  	 * scheduling slack) and require that the "sock " counter
>  	 * eventually drops to zero.
> +	 *
> +	 * The actual run-to-run elapse time between consecutive run
> +	 * of asynchronous memcg rstat flush may varies quite a bit.
> +	 * So the 3 seconds wait time may not be enough for the "sock"
> +	 * counter to go down to 0. Treat it as a XFAIL instead of
> +	 * a FAIL.
>  	 */
>  	sock_post = cg_read_key_long_poll(memcg, "memory.stat", "sock ", 0,
>  					 MEMCG_SOCKSTAT_WAIT_RETRIES,
>  					 DEFAULT_WAIT_INTERVAL_US);
> -	if (sock_post)
> +	if (sock_post) {
> +		if (sock_post > 0)
> +			ret = KSFT_XFAIL;

XFAIL means "expected failure" and is intended for known kernel bugs or
unsupported features. A timing issue where the test simply doesn't wait
long enough probably not an expected failure, it's a test that needs a
longer timeout.

I'm wondering can we just enlarge the MEMCG_SOCKSTAT_WAIT_RETRIES value?
e.g. from 30 to 150


-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
  2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
@ 2026-03-23 12:46   ` Li Wang
  2026-03-24  0:15     ` Yosry Ahmed
  0 siblings, 1 reply; 24+ messages in thread
From: Li Wang @ 2026-03-23 12:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> The vmstats flush threshold currently increases linearly with the
> number of online CPUs. As the number of CPUs increases over time, it
> will become increasingly difficult to meet the threshold and update the
> vmstats data in a timely manner. These days, systems with hundreds of
> CPUs or even thousands of them are becoming more common.
> 
> For example, the test_memcg_sock test of test_memcontrol always fails
> when running on an arm64 system with 128 CPUs. It is because the
> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> 32 MB of memory. It will be even worse with larger page size like 64k.
> 
> To make the output of memory.stat more correct, it is better to scale
> up the threshold slower than linearly with the number of CPUs. The
> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> An extra 2 is added to make sure that we will double the threshold for
> a 2-core system. The increase will be slower after that.
> 
> With the int_sqrt() scale, we can use the possibly larger
> num_possible_cpus() instead of num_online_cpus() which may change at
> run time.
> 
> Although there is supposed to be a periodic and asynchronous flush of
> vmstats every 2 seconds, the actual time lag between succesive runs
> can actually vary quite a bit. In fact, I have seen time lags of up
> to 10s of seconds in some cases. So we couldn't too rely on the hope
> that there will be an asynchronous vmstats flush every 2 seconds. This
> may be something we need to look into.
> 
> [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
> 
> Suggested-by: Li Wang <liwang@redhat.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 772bac21d155..cc1fc0f5aeea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -548,20 +548,20 @@ struct memcg_vmstats {
>   *    rstat update tree grow unbounded.
>   *
>   * 2) Flush the stats synchronously on reader side only when there are more than
> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
> - *    only for 2 seconds due to (1).
> + *    (MEMCG_CHARGE_BATCH * int_sqrt(nr_cpus+2)) update events. Though this
> + *    optimization will let stats be out of sync by up to that amount. This is
> + *    supposed to last for up to 2 seconds due to (1).
>   */
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>  static u64 flush_last_time;
> +static int vmstats_flush_threshold __ro_after_init;
>  
>  #define FLUSH_TIME (2UL*HZ)
>  
>  static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>  {
> -	return atomic_read(&vmstats->stats_updates) >
> -		MEMCG_CHARGE_BATCH * num_online_cpus();
> +	return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
>  }
>  
>  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
> @@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)
>  
>  	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
>  				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
> +	/*
> +	 * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
> +	 * 2 constant is to make sure that the threshold is double for a 2-core
> +	 * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
> +	 * number of the CPUs reaches the next (2^n - 2) value.
> +	 */
> +	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
> +				  (int_sqrt(num_possible_cpus() + 2));
>  
>  	return 0;
>  }

Reviewed-by: Li Wang <liwang@redhat.com>

-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
@ 2026-03-23 12:47   ` Li Wang
  2026-03-24  0:17     ` Yosry Ahmed
  0 siblings, 1 reply; 24+ messages in thread
From: Li Wang @ 2026-03-23 12:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On Fri, Mar 20, 2026 at 04:42:36PM -0400, Waiman Long wrote:
> For a system with 4k page size, each percpu memcg_stock can hide up
> to 256 kbytes of memory with the current MEMCG_CHARGE_BATCH value of
> 64. For another system with 64k page size, that becomes 4 Mbytes. This
> hidden charges will affect the accurary of the memory.current value.
> 
> This MEMCG_CHARGE_BATCH value also controls how often should the
> memcg vmstat values need flushing. As a result, the values reported
> in memory.stat cgroup control files are less indicative of the actual
> memory consumption of a particular memory cgroup when the page size
> increases from 4k.
> 
> This problem can be illustrated by running the test_memcontrol
> selftest. Running a 4k page size kernel on a 128-core arm64 system,
> the test_memcg_current_peak test which allocates a 50M anonymous memory
> passed. With a 64k page size kernel on the same system, however, the
> same test failed because the "anon" attribute of memory.stat file might
> report a size of 0 depending on the number of CPUs the system has.
> 
> To solve this inaccurate memory stats problem, we need to scale down
> the amount of memory that can be hidden by reducing MEMCG_CHARGE_BATCH
> when the page size increases. The same user application will likely
> consume more memory on systems with larger page size and it is also
> less efficient if we scale down MEMCG_CHARGE_BATCH by too much.  So I
> believe a good compromise is to scale down MEMCG_CHARGE_BATCH by 2 for
> 16k page size and by 4 with 64k page size.
> 
> With that change, the test_memcg_current_peak test passed again with
> the modified 64k page size kernel.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/memcontrol.h | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 70b685a85bf4..748cfd75d998 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -328,8 +328,14 @@ struct mem_cgroup {
>   * size of first charge trial.
>   * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
>   * workload.
> + *
> + * There are 3 common base page sizes - 4k, 16k & 64k. In order to limit the
> + * amount of memory that can be hidden in each percpu memcg_stock for a given
> + * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
>   */
> -#define MEMCG_CHARGE_BATCH 64U
> +#define MEMCG_CHARGE_BATCH_BASE  64U
> +#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT - 12)/2 : 2)
> +#define MEMCG_CHARGE_BATCH	 (MEMCG_CHARGE_BATCH_BASE >> MEMCG_CHARGE_BATCH_SHIFT)
>  
>  extern struct mem_cgroup *root_mem_cgroup;

Reviewed-by: Li Wang <liwang@redhat.com>

-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
  2026-03-23 12:46   ` Li Wang
@ 2026-03-24  0:15     ` Yosry Ahmed
  2026-03-25 16:47       ` Waiman Long
  0 siblings, 1 reply; 24+ messages in thread
From: Yosry Ahmed @ 2026-03-24  0:15 UTC (permalink / raw)
  To: Li Wang
  Cc: Waiman Long, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Tejun Heo,
	Michal Koutný, Shuah Khan, Mike Rapoport, linux-kernel,
	cgroups, linux-mm, linux-kselftest, Sean Christopherson,
	James Houghton, Sebastian Chlad, Guopeng Zhang, Li Wang

On Mon, Mar 23, 2026 at 5:46 AM Li Wang <liwang@redhat.com> wrote:
>
> On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> > The vmstats flush threshold currently increases linearly with the
> > number of online CPUs. As the number of CPUs increases over time, it
> > will become increasingly difficult to meet the threshold and update the
> > vmstats data in a timely manner. These days, systems with hundreds of
> > CPUs or even thousands of them are becoming more common.
> >
> > For example, the test_memcg_sock test of test_memcontrol always fails
> > when running on an arm64 system with 128 CPUs. It is because the
> > threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> > 32 MB of memory. It will be even worse with larger page size like 64k.
> >
> > To make the output of memory.stat more correct, it is better to scale
> > up the threshold slower than linearly with the number of CPUs. The
> > int_sqrt() function is a good compromise as suggested by Li Wang [1].
> > An extra 2 is added to make sure that we will double the threshold for
> > a 2-core system. The increase will be slower after that.
> >
> > With the int_sqrt() scale, we can use the possibly larger
> > num_possible_cpus() instead of num_online_cpus() which may change at
> > run time.
> >
> > Although there is supposed to be a periodic and asynchronous flush of
> > vmstats every 2 seconds, the actual time lag between succesive runs
> > can actually vary quite a bit. In fact, I have seen time lags of up
> > to 10s of seconds in some cases. So we couldn't too rely on the hope
> > that there will be an asynchronous vmstats flush every 2 seconds. This
> > may be something we need to look into.
> >
> > [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
> >
> > Suggested-by: Li Wang <liwang@redhat.com>
> > Signed-off-by: Waiman Long <longman@redhat.com>

What's the motivation for this fix? Is it purely to make tests more
reliable on systems with larger page sizes?

We need some performance tests to make sure we're not flushing too
eagerly with the sqrt scale imo. We need to make sure that when we
have a lot of cgroups and a lot of flushers we don't end up performing
worse.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  2026-03-23 12:47   ` Li Wang
@ 2026-03-24  0:17     ` Yosry Ahmed
  0 siblings, 0 replies; 24+ messages in thread
From: Yosry Ahmed @ 2026-03-24  0:17 UTC (permalink / raw)
  To: Li Wang
  Cc: Waiman Long, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Tejun Heo,
	Michal Koutný, Shuah Khan, Mike Rapoport, linux-kernel,
	cgroups, linux-mm, linux-kselftest, Sean Christopherson,
	James Houghton, Sebastian Chlad, Guopeng Zhang, Li Wang

On Mon, Mar 23, 2026 at 5:47 AM Li Wang <liwang@redhat.com> wrote:
>
> On Fri, Mar 20, 2026 at 04:42:36PM -0400, Waiman Long wrote:
> > For a system with 4k page size, each percpu memcg_stock can hide up
> > to 256 kbytes of memory with the current MEMCG_CHARGE_BATCH value of
> > 64. For another system with 64k page size, that becomes 4 Mbytes. This
> > hidden charges will affect the accurary of the memory.current value.
> >
> > This MEMCG_CHARGE_BATCH value also controls how often should the
> > memcg vmstat values need flushing. As a result, the values reported
> > in memory.stat cgroup control files are less indicative of the actual
> > memory consumption of a particular memory cgroup when the page size
> > increases from 4k.
> >
> > This problem can be illustrated by running the test_memcontrol
> > selftest. Running a 4k page size kernel on a 128-core arm64 system,
> > the test_memcg_current_peak test which allocates a 50M anonymous memory
> > passed. With a 64k page size kernel on the same system, however, the
> > same test failed because the "anon" attribute of memory.stat file might
> > report a size of 0 depending on the number of CPUs the system has.
> >
> > To solve this inaccurate memory stats problem, we need to scale down
> > the amount of memory that can be hidden by reducing MEMCG_CHARGE_BATCH
> > when the page size increases. The same user application will likely
> > consume more memory on systems with larger page size and it is also
> > less efficient if we scale down MEMCG_CHARGE_BATCH by too much.  So I
> > believe a good compromise is to scale down MEMCG_CHARGE_BATCH by 2 for
> > 16k page size and by 4 with 64k page size.
> >
> > With that change, the test_memcg_current_peak test passed again with
> > the modified 64k page size kernel.
> >
> > Signed-off-by: Waiman Long <longman@redhat.com>

We need performance testing for this too.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size
  2026-03-23  2:53   ` Li Wang
  2026-03-23  2:56     ` Li Wang
@ 2026-03-25  3:33     ` Waiman Long
  1 sibling, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-03-25  3:33 UTC (permalink / raw)
  To: Li Wang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang


On 3/22/26 10:53 PM, Li Wang wrote:
> Hi Waiman,
>
> I currently have another patch in hand that is functionally identical to
> patch 3/7 here (requested by Sashiko).
>
> May I go ahead and merge it directly into my patch series, while
> retaining your authorship attribution?

Sure. You can merge this patch into your series. I just need a page_size 
global variable to show the system page size for the rests of my 
memcontrol.c changes. I will then based my series on top of yours since 
it is further along.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger page size
  2026-03-23  8:24   ` Li Wang
@ 2026-03-25  3:47     ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-03-25  3:47 UTC (permalink / raw)
  To: Li Wang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On 3/23/26 4:24 AM, Li Wang wrote:
> On Fri, Mar 20, 2026 at 04:42:39PM -0400, Waiman Long wrote:
>> When running the test_memcg_swap_max_peak test which sets swap.max
>> to 30M on an arm64 system with 64k page size, the test failed as the
>> swap.peak could only reach up only to 27,328,512 bytes (about 25.45
>> MB which is lower than the expected 29M) before the allocating task
>> got oom-killed.
>>
>> It is likely due to the fact that it takes longer to write out a larger
>> page to swap and hence a lower swap.peak is being reached. Setting
>> memory.high to 29M to throttle memory allocation when nearing memory.max
>> helps, but it still could only reach up to 29,032,448 bytes (about
>> 27.04M). As a result, we have to reduce the expected swap.peak with
>> larger page size. Now swap.peak is expected to reach only 27M with 64k
>> page, 29M with 4k page and 28M with 16k page.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   .../selftests/cgroup/test_memcontrol.c        | 26 ++++++++++++++++---
>>   1 file changed, 22 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
>> index c078fc458def..3832ded1e47b 100644
>> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
>> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
>> @@ -1032,6 +1032,7 @@ static int test_memcg_swap_max_peak(const char *root)
>>   	char *memcg;
>>   	long max, peak;
>>   	struct stat ss;
>> +	long swap_peak;
>>   	int swap_peak_fd = -1, mem_peak_fd = -1;
>>   
>>   	/* any non-empty string resets */
>> @@ -1119,6 +1120,23 @@ static int test_memcg_swap_max_peak(const char *root)
>>   	if (cg_write(memcg, "memory.max", "30M"))
>>   		goto cleanup;
>>   
>> +	/*
>> +	 * The swap.peak that can be reached will depend on the system page
>> +	 * size. With larger page size (e.g. 64k), it takes more time to write
>> +	 * the anonymous memory page to swap and so the peak reached will be
>> +	 * lower before the memory allocation process get oom-killed. One way
>> +	 * to allow the swap.peak to go higher is to throttle memory allocation
>> +	 * by setting memory.high to, say, 29M to give more time to swap out the
>> +	 * memory before oom-kill. This is still not enough for it to reach
>> +	 * 29M reachable with 4k page. So we still need to reduce the expected
>> +	 * swap.peak accordingly.
>> +	 */
>> +	swap_peak = (page_size == KB(4)) ? MB(29) :
>> +		   ((page_size <= KB(16)) ? MB(28) : MB(27));
> Or, go with a dynamic adjustment based on page size?
>
>      swap_peak = MB(29) - ilog2(page_size / KB(4)) * MB(1);
>
It is a good suggestion. I will adopt a dynamic base adjustment as 
suggested.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with page size
  2026-03-23  8:01   ` Li Wang
@ 2026-03-25 16:42     ` Waiman Long
  0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2026-03-25 16:42 UTC (permalink / raw)
  To: Li Wang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On 3/23/26 4:01 AM, Li Wang wrote:
> On Fri, Mar 20, 2026 at 04:42:38PM -0400, Waiman Long wrote:
>> It was found that some of the tests in test_memcontrol can fail more
>> readily if system page size is larger than 4k. It is because the
>> actual memory.current value deviates more from the expected value with
>> larger page size. This is likely due to the fact there may be up to
>> MEMCG_CHARGE_BATCH pages of charge hidden in each one of the percpu
>> memcg_stock.
>>
>> To avoid this failure, the error tolerance is now increased in accordance
>> to the current system page size value. The page size scale factor is
>> set to 2 for 64k page and 1 for 16k page.
>>
>> Changes are made in alloc_pagecache_max_30M(), test_memcg_protection()
>> and alloc_anon_50M_check_swap() to increase the error tolerance for
>> memory.current for larger page size. The current set of values are
>> chosen to ensure that the relevant test_memcontrol tests no longer
>> have any test failure in a 100 repeated run of test_memcontrol with a
>> 4k/16k/64k page size kernels on an arm64 system.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   .../cgroup/lib/include/cgroup_util.h          |  3 ++-
>>   .../selftests/cgroup/test_memcontrol.c        | 23 ++++++++++++++-----
>>   2 files changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
>> index 77f386dab5e8..2293e770e9b4 100644
>> --- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
>> +++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
>> @@ -6,7 +6,8 @@
>>   #define PAGE_SIZE 4096
>>   #endif
>>   
>> -#define MB(x) (x << 20)
>> +#define KB(x) ((x) << 10)
>> +#define MB(x) ((x) << 20)
>>   
>>   #define USEC_PER_SEC	1000000L
>>   #define NSEC_PER_SEC	1000000000L
>> diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
>> index babbfad10aaf..c078fc458def 100644
>> --- a/tools/testing/selftests/cgroup/test_memcontrol.c
>> +++ b/tools/testing/selftests/cgroup/test_memcontrol.c
>> @@ -26,6 +26,7 @@
>>   static bool has_localevents;
>>   static bool has_recursiveprot;
>>   static int page_size;
>> +static int pscale_factor;	/* Page size scale factor */
>>   
>>   int get_temp_fd(void)
>>   {
>> @@ -571,16 +572,17 @@ static int test_memcg_protection(const char *root, bool min)
>>   	if (cg_run(parent[2], alloc_anon, (void *)MB(148)))
>>   		goto cleanup;
>>   
>> -	if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50), 3))
>> +	if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50),
>> +				       3 + (min ? 0 : 4) * pscale_factor))
>>   		goto cleanup;
>>   
>>   	for (i = 0; i < ARRAY_SIZE(children); i++)
>>   		c[i] = cg_read_long(children[i], "memory.current");
>>   
>> -	if (!values_close(c[0], MB(29), 15))
>> +	if (!values_close(c[0], MB(29), 15 + 3 * pscale_factor))
>>   		goto cleanup;
>>   
>> -	if (!values_close(c[1], MB(21), 20))
>> +	if (!values_close(c[1], MB(21), 20 + pscale_factor))
>>   		goto cleanup;
>>   
>>   	if (c[3] != 0)
>> @@ -596,7 +598,8 @@ static int test_memcg_protection(const char *root, bool min)
>>   	}
>>   
>>   	current = min ? MB(50) : MB(30);
>> -	if (!values_close(cg_read_long(parent[1], "memory.current"), current, 3))
>> +	if (!values_close(cg_read_long(parent[1], "memory.current"), current,
>> +				       9 + (min ? 0 : 6) * pscale_factor))
>>   		goto cleanup;
>>   
>>   	if (!reclaim_until(children[0], MB(10)))
>> @@ -684,7 +687,7 @@ static int alloc_pagecache_max_30M(const char *cgroup, void *arg)
>>   		goto cleanup;
>>   
>>   	current = cg_read_long(cgroup, "memory.current");
>> -	if (!values_close(current, MB(30), 5))
>> +	if (!values_close(current, MB(30), 5 + (pscale_factor ? 2 : 0)))
>>   		goto cleanup;
>>   
>>   	ret = 0;
>> @@ -1004,7 +1007,7 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
>>   		*ptr = 0;
>>   
>>   	mem_current = cg_read_long(cgroup, "memory.current");
>> -	if (!mem_current || !values_close(mem_current, mem_max, 3))
>> +	if (!mem_current || !values_close(mem_current, mem_max, 6 + pscale_factor))
>>   		goto cleanup;
>>   
>>   	swap_current = cg_read_long(cgroup, "memory.swap.current");
>> @@ -1684,6 +1687,14 @@ int main(int argc, char **argv)
>>   	if (page_size <= 0)
>>   		page_size = PAGE_SIZE;
>>   
>> +	/*
>> +	 * It is found that the actual memory.current value can deviate more
>> +	 * from the expected value with larger page size. So error tolerance
>> +	 * will have to be increased a bit more for larger page size.
>> +	 */
>> +	if (page_size > KB(4))
>> +		pscale_factor = (page_size >= KB(64)) ? 2 : 1;
> This is a good improment but I still think the pscale_factor adjustments
> are a bit fragile, each call site needs its own hand-tuned formula, and only
> three page sizes (4K/16K/64K) are handled. If a new page size shows up,
> every call site needs revisiting.
>
> How about centralizing the page size adjustment inside values_close()
> itself? Something like:
>
>      static inline int values_close(long a, long b, int err)
>      {
>            ssize_t page_adjusted_err = ffs(page_size >> 13) + err;
>      
>            return 100 * labs(a - b) <= (a + b) * page_adjusted_err;
>      }
>
> This adds one extra percent of tolerance per doubling above 4K, scales
> continuously for any power-of-two page size, and also fixes an integer
> truncation issue in the original: (a + b) / 100 * err loses precision
> when (a + b) < 100.
>
> With this, the callers wouldn't need any changes at all.
>
> This method is inspired from LTP:
>    https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/memcontrol_common.h#L27

Good point. I will implement something like in the next version.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
  2026-03-24  0:15     ` Yosry Ahmed
@ 2026-03-25 16:47       ` Waiman Long
  2026-03-25 17:23         ` Yosry Ahmed
  0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2026-03-25 16:47 UTC (permalink / raw)
  To: Yosry Ahmed, Li Wang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On 3/23/26 8:15 PM, Yosry Ahmed wrote:
> On Mon, Mar 23, 2026 at 5:46 AM Li Wang <liwang@redhat.com> wrote:
>> On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
>>> The vmstats flush threshold currently increases linearly with the
>>> number of online CPUs. As the number of CPUs increases over time, it
>>> will become increasingly difficult to meet the threshold and update the
>>> vmstats data in a timely manner. These days, systems with hundreds of
>>> CPUs or even thousands of them are becoming more common.
>>>
>>> For example, the test_memcg_sock test of test_memcontrol always fails
>>> when running on an arm64 system with 128 CPUs. It is because the
>>> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
>>> 32 MB of memory. It will be even worse with larger page size like 64k.
>>>
>>> To make the output of memory.stat more correct, it is better to scale
>>> up the threshold slower than linearly with the number of CPUs. The
>>> int_sqrt() function is a good compromise as suggested by Li Wang [1].
>>> An extra 2 is added to make sure that we will double the threshold for
>>> a 2-core system. The increase will be slower after that.
>>>
>>> With the int_sqrt() scale, we can use the possibly larger
>>> num_possible_cpus() instead of num_online_cpus() which may change at
>>> run time.
>>>
>>> Although there is supposed to be a periodic and asynchronous flush of
>>> vmstats every 2 seconds, the actual time lag between succesive runs
>>> can actually vary quite a bit. In fact, I have seen time lags of up
>>> to 10s of seconds in some cases. So we couldn't too rely on the hope
>>> that there will be an asynchronous vmstats flush every 2 seconds. This
>>> may be something we need to look into.
>>>
>>> [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
>>>
>>> Suggested-by: Li Wang <liwang@redhat.com>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
> What's the motivation for this fix? Is it purely to make tests more
> reliable on systems with larger page sizes?
>
> We need some performance tests to make sure we're not flushing too
> eagerly with the sqrt scale imo. We need to make sure that when we
> have a lot of cgroups and a lot of flushers we don't end up performing
> worse.

I will include some performance data in the next version. Do you have 
any suggestion of which readily available tests that I can use for this 
performance testing purpose.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
  2026-03-25 16:47       ` Waiman Long
@ 2026-03-25 17:23         ` Yosry Ahmed
  0 siblings, 0 replies; 24+ messages in thread
From: Yosry Ahmed @ 2026-03-25 17:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Wang, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Tejun Heo,
	Michal Koutný, Shuah Khan, Mike Rapoport, linux-kernel,
	cgroups, linux-mm, linux-kselftest, Sean Christopherson,
	James Houghton, Sebastian Chlad, Guopeng Zhang, Li Wang

On Wed, Mar 25, 2026 at 9:47 AM Waiman Long <longman@redhat.com> wrote:
>
> On 3/23/26 8:15 PM, Yosry Ahmed wrote:
> > On Mon, Mar 23, 2026 at 5:46 AM Li Wang <liwang@redhat.com> wrote:
> >> On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> >>> The vmstats flush threshold currently increases linearly with the
> >>> number of online CPUs. As the number of CPUs increases over time, it
> >>> will become increasingly difficult to meet the threshold and update the
> >>> vmstats data in a timely manner. These days, systems with hundreds of
> >>> CPUs or even thousands of them are becoming more common.
> >>>
> >>> For example, the test_memcg_sock test of test_memcontrol always fails
> >>> when running on an arm64 system with 128 CPUs. It is because the
> >>> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> >>> 32 MB of memory. It will be even worse with larger page size like 64k.
> >>>
> >>> To make the output of memory.stat more correct, it is better to scale
> >>> up the threshold slower than linearly with the number of CPUs. The
> >>> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> >>> An extra 2 is added to make sure that we will double the threshold for
> >>> a 2-core system. The increase will be slower after that.
> >>>
> >>> With the int_sqrt() scale, we can use the possibly larger
> >>> num_possible_cpus() instead of num_online_cpus() which may change at
> >>> run time.
> >>>
> >>> Although there is supposed to be a periodic and asynchronous flush of
> >>> vmstats every 2 seconds, the actual time lag between succesive runs
> >>> can actually vary quite a bit. In fact, I have seen time lags of up
> >>> to 10s of seconds in some cases. So we couldn't too rely on the hope
> >>> that there will be an asynchronous vmstats flush every 2 seconds. This
> >>> may be something we need to look into.
> >>>
> >>> [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
> >>>
> >>> Suggested-by: Li Wang <liwang@redhat.com>
> >>> Signed-off-by: Waiman Long <longman@redhat.com>
> > What's the motivation for this fix? Is it purely to make tests more
> > reliable on systems with larger page sizes?
> >
> > We need some performance tests to make sure we're not flushing too
> > eagerly with the sqrt scale imo. We need to make sure that when we
> > have a lot of cgroups and a lot of flushers we don't end up performing
> > worse.
>
> I will include some performance data in the next version. Do you have
> any suggestion of which readily available tests that I can use for this
> performance testing purpose.

I am not sure what readily available tests can stress this. In the
past, I wrote a synthetic workload that spawns a lot of readers in
memory.stat in userspace as well as reclaimers to trigger flushing
from both the kernel and userspace, with a large number of cgroups. I
don't have that lying around unfortunately.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-03-25 17:23 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
2026-03-23 12:46   ` Li Wang
2026-03-24  0:15     ` Yosry Ahmed
2026-03-25 16:47       ` Waiman Long
2026-03-25 17:23         ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-23 12:47   ` Li Wang
2026-03-24  0:17     ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-23  2:53   ` Li Wang
2026-03-23  2:56     ` Li Wang
2026-03-25  3:33     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-23  8:01   ` Li Wang
2026-03-25 16:42     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-23  8:24   ` Li Wang
2026-03-25  3:47     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-23  8:53   ` Li Wang
2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-23  9:44   ` Li Wang
2026-03-21  1:16 ` [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox