[PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
@ 2026-03-19 17:37 Waiman Long
  2026-03-19 17:37 ` [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus) Waiman Long
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Waiman Long @ 2026-03-19 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

There are a number of test failures with the running of the
test_memcontrol selftest on a 128-core arm64 system on kernels with
4k/16k/64k page sizes. This patch series makes some minor changes to
the kernel and the test_memcontrol selftest to address these failures.

The first kernel patch scales the memcg vmstats flush threshold
logarithmetically instead of linearly with the total number of CPUs. The
second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
size. These 2 patches help to reduce the discrepancies between the
reported usage data with the real ones.

The next 5 test_memcontrol selftest patches adjust the testing code to
greatly reduce the chance that it will report failure, though some
occasional failures is still possible.

To verify the changes, the test_memcontrol selftest was run 100
times each on a 128-core arm64 system on kernels with 4k/16k/64k
page sizes.  No failure was observed other than some failures of the
test_memcg_reclaim test when running on a 16k page size kernel. The
reclaim_until() call failed because of the unexpected over-reclaim of
memory. This will need a further look but it happens with the 16k page
size kernel only and I don't have a production ready kernel config file
to use in buildinig this 16k page size kernel. The new test_memcontrol
selftest and kernel were also run on a 96-core x86 system to make sure
there was no regression.

Waiman Long (7):
  memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus)
  memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  selftests: memcg: Iterate pages based on the actual page size
  selftests: memcg: Increase error tolerance in accordance with page
    size
  selftests: memcg: Reduce the expected swap.peak with larger page size
  selftests: memcg: Don't call reclaim_until() if already in target
  selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as
    XFAIL

 include/linux/memcontrol.h                    |  8 +-
 mm/memcontrol.c                               | 17 ++--
 .../cgroup/lib/include/cgroup_util.h          |  1 +
 .../selftests/cgroup/test_memcontrol.c        | 83 +++++++++++++++----
 4 files changed, 87 insertions(+), 22 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus)
  2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
@ 2026-03-19 17:37 ` Waiman Long
  2026-03-20 10:40   ` Li Wang
  2026-03-19 17:37 ` [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-03-19 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

The vmstats flush threshold currently increases linearly with the
number of online CPUs. As the number of CPUs increases over time, it
will become increasingly difficult to meet the threshold and update the
vmstats data in a timely manner. These days, systems with hundreds of
CPUs or even thousands of them are becoming more common.

For example, the test_memcg_sock test of test_memcontrol always fails
when running on an arm64 system with 128 CPUs. It is because the
threshold is now 64*128 = 8192. With 4k page size, it needs changes in
32 MB of memory. It will be even worse with larger page size like 64k.

To make the output of memory.stat more correct, it is better to
scale up the threshold logarithmically instead of linearly with the
number of CPUs. With the log2 scale, we can use the possibly larger
num_possible_cpus() instead of num_online_cpus() which may change at
run time.

Although there is supposed to be a periodic and asynchronous flush of
vmstats every 2 seconds, the actual time lag between succesive runs
can actually vary quite a bit. In fact, I have seen time lags of up
to 10s of seconds in some cases. So we couldn't too rely on the hope
that there will be an asynchronous vmstats flush every 2 seconds. This
may be something we need to look into.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/memcontrol.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..8d4ede72f05c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -548,20 +548,20 @@ struct memcg_vmstats {
  *    rstat update tree grow unbounded.
  *
  * 2) Flush the stats synchronously on reader side only when there are more than
- *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
- *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
- *    only for 2 seconds due to (1).
+ *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. Though this
+ *    optimization will let stats be out of sync by up to that amount but only
+ *    for 2 seconds due to (1).
  */
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
 static u64 flush_last_time;
+static int vmstats_flush_threshold __ro_after_init;

 #define FLUSH_TIME (2UL*HZ)

 static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
 {
-	return atomic_read(&vmstats->stats_updates) >
-		MEMCG_CHARGE_BATCH * num_online_cpus();
+	return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
 }

 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
@@ -5191,6 +5191,13 @@ int __init mem_cgroup_init(void)

 	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
 				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
+	/*
+	 * Logarithmically scale up vmstats flush threshold with the number
+	 * of CPUs.
+	 * N.B. ilog2(1) = 0.
+	 */
+	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
+				  (ilog2(num_possible_cpus()) + 1);

 	return 0;
 }
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
  2026-03-19 17:37 ` [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus) Waiman Long
@ 2026-03-19 17:37 ` Waiman Long
  2026-03-20 11:26   ` Li Wang
  2026-03-19 17:37 ` [PATCH 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-03-19 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

For a system with 4k page size, each percpu memcg_stock can hide up
to 256 kbytes of memory with the current MEMCG_CHARGE_BATCH value of
64. For another system with 64k page size, that becomes 4 Mbytes.
This MEMCG_CHARGE_BATCH value also controls how often should the
memcg vmstat values need flushing. As a result, the values reported
in various memory cgroup control files are even less indicative of the
actual memory consumption of a particular memory cgroup when the page
size increases from 4k.

This problem can be illustrated by running the test_memcontrol
selftest. Running a 4k page size kernel on a 128-core arm64 system,
the test_memcg_current_peak test which allocates a 50M anonymous memory
passed. With a 64k page size kernel on the same system, however, the
same test failed because the "anon" attribute of memory.stat file might
report a size of 0 depending on the number of CPUs the system has.

To solve this inaccurate memory stats problem, we need to scale down
the amount of memory that can be hidden by reducing MEMCG_CHARGE_BATCH
when the page size increases. The same user application will likely
consume more memory on systems with larger page size and it is also
less efficient if we scale down MEMCG_CHARGE_BATCH by too much.  So I
believe a good compromise is to scale down MEMCG_CHARGE_BATCH by 2 for
16k page size and by 4 with 64k page size.

With that change, the test_memcg_current_peak test passed again with
the modified 64k page size kernel.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 70b685a85bf4..748cfd75d998 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -328,8 +328,14 @@ struct mem_cgroup {
  * size of first charge trial.
  * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
  * workload.
+ *
+ * There are 3 common base page sizes - 4k, 16k & 64k. In order to limit the
+ * amount of memory that can be hidden in each percpu memcg_stock for a given
+ * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
  */
-#define MEMCG_CHARGE_BATCH 64U
+#define MEMCG_CHARGE_BATCH_BASE  64U
+#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT - 12)/2 : 2)
+#define MEMCG_CHARGE_BATCH	 (MEMCG_CHARGE_BATCH_BASE >> MEMCG_CHARGE_BATCH_SHIFT)

 extern struct mem_cgroup *root_mem_cgroup;

-- 
2.53.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/7] selftests: memcg: Iterate pages based on the actual page size
  2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
  2026-03-19 17:37 ` [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus) Waiman Long
  2026-03-19 17:37 ` [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
@ 2026-03-19 17:37 ` Waiman Long
  2026-03-20 11:34   ` Li Wang
  2026-03-19 17:37 ` [PATCH 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-03-19 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

The current test_memcontrol test fault in memory by write a value
to the start of a page based on the default value of 4k page size.
Micro-optimize it by using the actual system page size to do the
iteration.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/test_memcontrol.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index a25eb097b31c..3cc8a432be91 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -25,6 +25,7 @@
 
 static bool has_localevents;
 static bool has_recursiveprot;
+static int page_size;
 
 int get_temp_fd(void)
 {
@@ -60,7 +61,7 @@ int alloc_anon(const char *cgroup, void *arg)
 	char *buf, *ptr;
 
 	buf = malloc(size);
-	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+	for (ptr = buf; ptr < buf + size; ptr += page_size)
 		*ptr = 0;
 
 	free(buf);
@@ -183,7 +184,7 @@ static int alloc_anon_50M_check(const char *cgroup, void *arg)
 		return -1;
 	}
 
-	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+	for (ptr = buf; ptr < buf + size; ptr += page_size)
 		*ptr = 0;
 
 	current = cg_read_long(cgroup, "memory.current");
@@ -413,7 +414,7 @@ static int alloc_anon_noexit(const char *cgroup, void *arg)
 		return -1;
 	}
 
-	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+	for (ptr = buf; ptr < buf + size; ptr += page_size)
 		*ptr = 0;
 
 	while (getppid() == ppid)
@@ -999,7 +1000,7 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
 		return -1;
 	}
 
-	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+	for (ptr = buf; ptr < buf + size; ptr += page_size)
 		*ptr = 0;
 
 	mem_current = cg_read_long(cgroup, "memory.current");
@@ -1679,6 +1680,7 @@ int main(int argc, char **argv)
 	char root[PATH_MAX];
 	int i, proc_status;
 
+	page_size = sysconf(_SC_PAGE_SIZE);
 	ksft_print_header();
 	ksft_set_plan(ARRAY_SIZE(tests));
 	if (cg_find_unified_root(root, sizeof(root), NULL))
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/7] selftests: memcg: Increase error tolerance in accordance with page size
  2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (2 preceding siblings ...)
  2026-03-19 17:37 ` [PATCH 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
@ 2026-03-19 17:37 ` Waiman Long
  2026-03-19 17:37 ` [PATCH 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-03-19 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

It was found that some of the tests in test_memcontrol can fail more
readily if system page size is larger than 4k. It is because the actual
memory.current value deviates more from the expected value with larger
page size.  To avoid this failure, the error tolerance is now increased
in accordance to the current system page size value. The page size
scale factor is set to 2 for 64k page and 1 for 16k page.

Changes are made in alloc_pagecache_max_30M(), test_memcg_protection()
and alloc_anon_50M_check_swap() to increase the error tolerance for
memory.current for larger page size. The current set of values are
chosen to ensure that the relevant test_memcontrol tests no longer
have any test failure in a 100 repeated run of test_memcontrol with a
4k/16k/64k page size kernels on an arm64 system.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 .../cgroup/lib/include/cgroup_util.h          |  1 +
 .../selftests/cgroup/test_memcontrol.c        | 23 ++++++++++++++-----
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
index 77f386dab5e8..c25228a78b8b 100644
--- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
@@ -6,6 +6,7 @@
 #define PAGE_SIZE 4096
 #endif
 
+#define KB(x) (x << 10)
 #define MB(x) (x << 20)
 
 #define USEC_PER_SEC	1000000L
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 3cc8a432be91..2c3a838536ae 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -26,6 +26,7 @@
 static bool has_localevents;
 static bool has_recursiveprot;
 static int page_size;
+static int pscale_factor;	/* Page size scale factor */
 
 int get_temp_fd(void)
 {
@@ -571,16 +572,17 @@ static int test_memcg_protection(const char *root, bool min)
 	if (cg_run(parent[2], alloc_anon, (void *)MB(148)))
 		goto cleanup;
 
-	if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50), 3))
+	if (!values_close(cg_read_long(parent[1], "memory.current"), MB(50),
+				       3 + (min ? 0 : 4) * pscale_factor))
 		goto cleanup;
 
 	for (i = 0; i < ARRAY_SIZE(children); i++)
 		c[i] = cg_read_long(children[i], "memory.current");
 
-	if (!values_close(c[0], MB(29), 15))
+	if (!values_close(c[0], MB(29), 15 + 3 * pscale_factor))
 		goto cleanup;
 
-	if (!values_close(c[1], MB(21), 20))
+	if (!values_close(c[1], MB(21), 20 + pscale_factor))
 		goto cleanup;
 
 	if (c[3] != 0)
@@ -596,7 +598,8 @@ static int test_memcg_protection(const char *root, bool min)
 	}
 
 	current = min ? MB(50) : MB(30);
-	if (!values_close(cg_read_long(parent[1], "memory.current"), current, 3))
+	if (!values_close(cg_read_long(parent[1], "memory.current"), current,
+				       9 + (min ? 0 : 6) * pscale_factor))
 		goto cleanup;
 
 	if (!reclaim_until(children[0], MB(10)))
@@ -684,7 +687,7 @@ static int alloc_pagecache_max_30M(const char *cgroup, void *arg)
 		goto cleanup;
 
 	current = cg_read_long(cgroup, "memory.current");
-	if (!values_close(current, MB(30), 5))
+	if (!values_close(current, MB(30), 5 + (pscale_factor ? 2 : 0)))
 		goto cleanup;
 
 	ret = 0;
@@ -1004,7 +1007,7 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
 		*ptr = 0;
 
 	mem_current = cg_read_long(cgroup, "memory.current");
-	if (!mem_current || !values_close(mem_current, mem_max, 3))
+	if (!mem_current || !values_close(mem_current, mem_max, 6 + pscale_factor))
 		goto cleanup;
 
 	swap_current = cg_read_long(cgroup, "memory.swap.current");
@@ -1681,6 +1684,14 @@ int main(int argc, char **argv)
 	int i, proc_status;
 
 	page_size = sysconf(_SC_PAGE_SIZE);
+	/*
+	 * It is found that the actual memory.current value can deviate more
+	 * from the expected value with larger page size. So error tolerance
+	 * will have to be increased a bit more for larger page size.
+	 */
+	if (page_size > KB(4))
+		pscale_factor = (page_size >= KB(64)) ? 2 : 1;
+
 	ksft_print_header();
 	ksft_set_plan(ARRAY_SIZE(tests));
 	if (cg_find_unified_root(root, sizeof(root), NULL))
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/7] selftests: memcg: Reduce the expected swap.peak with larger page size
  2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (3 preceding siblings ...)
  2026-03-19 17:37 ` [PATCH 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
@ 2026-03-19 17:37 ` Waiman Long
  2026-03-19 17:37 ` [PATCH 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-03-19 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

When running the test_memcg_swap_max_peak test which sets swap.max
to 30M on an arm64 system with 64k page size, the test failed as the
swap.peak could only reach up only to 27,328,512 bytes (about 25.45
MB which is lower than the expected 29M) before the allocating task
got oom-killed.

It is likely due to the fact that it takes longer to write out a larger
page to swap and hence a lower swap.peak is being reached. Setting
memory.high to 29M to throttle memory allocation when nearing memory.max
helps, but it still could only reach up to 29,032,448 bytes (about
27.04M). As a result, we have to reduce the expected swap.peak with
larger page size. Now swap.peak is expected to reach only 27M with 64k
page, 29M with 4k page and 28M with 16k page.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 .../selftests/cgroup/test_memcontrol.c        | 26 ++++++++++++++++---
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 2c3a838536ae..4f12d4b4f9f8 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1032,6 +1032,7 @@ static int test_memcg_swap_max_peak(const char *root)
 	char *memcg;
 	long max, peak;
 	struct stat ss;
+	long swap_peak;
 	int swap_peak_fd = -1, mem_peak_fd = -1;
 
 	/* any non-empty string resets */
@@ -1119,6 +1120,23 @@ static int test_memcg_swap_max_peak(const char *root)
 	if (cg_write(memcg, "memory.max", "30M"))
 		goto cleanup;
 
+	/*
+	 * The swap.peak that can be reached will depend on the system page
+	 * size. With larger page size (e.g. 64k), it takes more time to write
+	 * the anonymous memory page to swap and so the peak reached will be
+	 * lower before the memory allocation process get oom-killed. One way
+	 * to allow the swap.peak to go higher is to throttle memory allocation
+	 * by setting memory.high to, say, 29M to give more time to swap out the
+	 * memory before oom-kill. This is still not enough for it to reach
+	 * 29M reachable with 4k page. So we still need to reduce the expected
+	 * swap.peak accordingly.
+	 */
+	swap_peak = (page_size == KB(4)) ? MB(29) :
+		   ((page_size <= KB(16)) ? MB(28) : MB(27));
+
+	if (cg_write(memcg, "memory.high", "29M"))
+		goto cleanup;
+
 	/* Should be killed by OOM killer */
 	if (!cg_run(memcg, alloc_anon, (void *)MB(100)))
 		goto cleanup;
@@ -1134,7 +1152,7 @@ static int test_memcg_swap_max_peak(const char *root)
 		goto cleanup;
 
 	peak = cg_read_long(memcg, "memory.swap.peak");
-	if (peak < MB(29))
+	if (peak < swap_peak)
 		goto cleanup;
 
 	peak = cg_read_long_fd(mem_peak_fd);
@@ -1142,7 +1160,7 @@ static int test_memcg_swap_max_peak(const char *root)
 		goto cleanup;
 
 	peak = cg_read_long_fd(swap_peak_fd);
-	if (peak < MB(29))
+	if (peak < swap_peak)
 		goto cleanup;
 
 	/*
@@ -1181,7 +1199,7 @@ static int test_memcg_swap_max_peak(const char *root)
 	if (cg_read_long(memcg, "memory.peak") < MB(29))
 		goto cleanup;
 
-	if (cg_read_long(memcg, "memory.swap.peak") < MB(29))
+	if (cg_read_long(memcg, "memory.swap.peak") < swap_peak)
 		goto cleanup;
 
 	if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30)))
@@ -1196,7 +1214,7 @@ static int test_memcg_swap_max_peak(const char *root)
 		goto cleanup;
 
 	peak = cg_read_long(memcg, "memory.swap.peak");
-	if (peak < MB(29))
+	if (peak < swap_peak)
 		goto cleanup;
 
 	peak = cg_read_long_fd(mem_peak_fd);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/7] selftests: memcg: Don't call reclaim_until() if already in target
  2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (4 preceding siblings ...)
  2026-03-19 17:37 ` [PATCH 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
@ 2026-03-19 17:37 ` Waiman Long
  2026-03-19 17:37 ` [PATCH 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
  2026-03-20  2:43 ` [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
  7 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-03-19 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

Near the end of test_memcg_protection(), reclaim_until() is called
to reduce memory.current of children[0] to 10M. It was found that
with larger page size (e.g. 64k) the various memory cgroups in
test_memcg_protection() would deviate further from the expected values
especially for the test_memcg_low test. As a result, children[0] might
have reached the target already without reclamation. The will cause the
reclaim_until() function to report failure as no reclamation is needed.

Avoid this unexpected failure by skipping the reclaim_until() call if
memory.current of children[0] has already reached the target size for
kernel with non-4k page size.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/test_memcontrol.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 4f12d4b4f9f8..0ef09bafa68c 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -490,6 +490,7 @@ static int test_memcg_protection(const char *root, bool min)
 	long current;
 	int i, attempts;
 	int fd;
+	bool do_reclaim;
 
 	fd = get_temp_fd();
 	if (fd < 0)
@@ -602,7 +603,15 @@ static int test_memcg_protection(const char *root, bool min)
 				       9 + (min ? 0 : 6) * pscale_factor))
 		goto cleanup;
 
-	if (!reclaim_until(children[0], MB(10)))
+	/*
+	 * With larger page size, it is possible that memory.current of
+	 * children[0] is close to 10M. Skip the reclaim_until() call if
+	 * that is the case.
+	 */
+	current = cg_read_long(children[0], "memory.current");
+	do_reclaim = (page_size == KB(4)) ||
+		     ((current > MB(10)) && !values_close(current, MB(10), 3));
+	if (do_reclaim && !reclaim_until(children[0], MB(10)))
 		goto cleanup;
 
 	if (min) {
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL
  2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (5 preceding siblings ...)
  2026-03-19 17:37 ` [PATCH 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
@ 2026-03-19 17:37 ` Waiman Long
  2026-03-20  2:43 ` [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
  7 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-03-19 17:37 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport
  Cc: linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang, Waiman Long

Although there is supposed to be a periodic and asynchronous flush of
stats every 2 seconds, the actual time lag between succesive runs can
actually vary quite a bit. In fact, I have seen time lag of up to 10s
of seconds in some cases.

At the end of test_memcg_sock, it waits up to 3 seconds for the
"sock" attribute of memory.stat to go back down to 0. Obviously it
may occasionally fail especially when the kernel has large page size
(e.g. 64k). Treat this failure as an expected failure (XFAIL) to
distinguish it from the other failure cases.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/cgroup/test_memcontrol.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 0ef09bafa68c..9206da51ac83 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1486,12 +1486,20 @@ static int test_memcg_sock(const char *root)
 	 * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
 	 * scheduling slack) and require that the "sock " counter
 	 * eventually drops to zero.
+	 *
+	 * The actual run-to-run elapse time between consecutive run
+	 * of asynchronous memcg rstat flush may varies quite a bit.
+	 * So the 3 seconds wait time may not be enough for the "sock"
+	 * counter to go down to 0. Treat it as a XFAIL instead of
+	 * a FAIL.
 	 */
 	sock_post = cg_read_key_long_poll(memcg, "memory.stat", "sock ", 0,
 					 MEMCG_SOCKSTAT_WAIT_RETRIES,
 					 DEFAULT_WAIT_INTERVAL_US);
-	if (sock_post)
+	if (sock_post) {
+		ret = KSFT_XFAIL;
 		goto cleanup;
+	}
 
 	ret = KSFT_PASS;
 
@@ -1753,6 +1761,9 @@ int main(int argc, char **argv)
 		case KSFT_SKIP:
 			ksft_test_result_skip("%s\n", tests[i].name);
 			break;
+		case KSFT_XFAIL:
+			ksft_test_result_xfail("%s\n", tests[i].name);
+			break;
 		default:
 			ksft_test_result_fail("%s\n", tests[i].name);
 			break;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
  2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
                   ` (6 preceding siblings ...)
  2026-03-19 17:37 ` [PATCH 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
@ 2026-03-20  2:43 ` Andrew Morton
  2026-03-20 15:56   ` Waiman Long
  7 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2026-03-20  2:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Tejun Heo, Michal Koutný, Shuah Khan,
	Mike Rapoport, linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang

On Thu, 19 Mar 2026 13:37:45 -0400 Waiman Long <longman@redhat.com> wrote:

> There are a number of test failures with the running of the
> test_memcontrol selftest on a 128-core arm64 system on kernels with
> 4k/16k/64k page sizes. This patch series makes some minor changes to
> the kernel and the test_memcontrol selftest to address these failures.
> 
> The first kernel patch scales the memcg vmstats flush threshold
> logarithmetically instead of linearly with the total number of CPUs. The
> second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
> size. These 2 patches help to reduce the discrepancies between the
> reported usage data with the real ones.
> 
> The next 5 test_memcontrol selftest patches adjust the testing code to
> greatly reduce the chance that it will report failure, though some
> occasional failures is still possible.
> 
> To verify the changes, the test_memcontrol selftest was run 100
> times each on a 128-core arm64 system on kernels with 4k/16k/64k
> page sizes.  No failure was observed other than some failures of the
> test_memcg_reclaim test when running on a 16k page size kernel. The
> reclaim_until() call failed because of the unexpected over-reclaim of
> memory. This will need a further look but it happens with the 16k page
> size kernel only and I don't have a production ready kernel config file
> to use in buildinig this 16k page size kernel. The new test_memcontrol
> selftest and kernel were also run on a 96-core x86 system to make sure
> there was no regression.

AI reviewbot asks questions:
	https://sashiko.dev/#/patchset/20260319173752.1472864-1-longman%40redhat.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus)
  2026-03-19 17:37 ` [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus) Waiman Long
@ 2026-03-20 10:40   ` Li Wang
  2026-03-20 13:19     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Li Wang @ 2026-03-20 10:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On Thu, Mar 19, 2026 at 01:37:46PM -0400, Waiman Long wrote:
> The vmstats flush threshold currently increases linearly with the
> number of online CPUs. As the number of CPUs increases over time, it
> will become increasingly difficult to meet the threshold and update the
> vmstats data in a timely manner. These days, systems with hundreds of
> CPUs or even thousands of them are becoming more common.
> 
> For example, the test_memcg_sock test of test_memcontrol always fails
> when running on an arm64 system with 128 CPUs. It is because the
> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> 32 MB of memory. It will be even worse with larger page size like 64k.
> 
> To make the output of memory.stat more correct, it is better to
> scale up the threshold logarithmically instead of linearly with the
> number of CPUs. With the log2 scale, we can use the possibly larger
> num_possible_cpus() instead of num_online_cpus() which may change at
> run time.
> 
> Although there is supposed to be a periodic and asynchronous flush of
> vmstats every 2 seconds, the actual time lag between succesive runs
> can actually vary quite a bit. In fact, I have seen time lags of up
> to 10s of seconds in some cases. So we couldn't too rely on the hope
> that there will be an asynchronous vmstats flush every 2 seconds. This
> may be something we need to look into.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 772bac21d155..8d4ede72f05c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -548,20 +548,20 @@ struct memcg_vmstats {
>   *    rstat update tree grow unbounded.
>   *
>   * 2) Flush the stats synchronously on reader side only when there are more than
> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
> - *    only for 2 seconds due to (1).
> + *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. Though this
> + *    optimization will let stats be out of sync by up to that amount but only
> + *    for 2 seconds due to (1).
>   */
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>  static u64 flush_last_time;
> +static int vmstats_flush_threshold __ro_after_init;
>  
>  #define FLUSH_TIME (2UL*HZ)
>  
>  static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>  {
> -	return atomic_read(&vmstats->stats_updates) >
> -		MEMCG_CHARGE_BATCH * num_online_cpus();
> +	return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
>  }
>  
>  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
> @@ -5191,6 +5191,13 @@ int __init mem_cgroup_init(void)
>  
>  	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
>  				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
> +	/*
> +	 * Logarithmically scale up vmstats flush threshold with the number
> +	 * of CPUs.
> +	 * N.B. ilog2(1) = 0.
> +	 */
> +	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
> +				  (ilog2(num_possible_cpus()) + 1);

Changing the threashold from linearly to logarithmically looks smarter,
but my concern is that, on large systems (hundreds/thousands of CPUs),
the threshold drops dramatically.

For example, 1024 CPUs it goes from 65536 (256MB) to only 704 (2.7MB),
that's almost 100x. Could this potentially raise a performance issue
as frequently read 'memory.stat' on a heavily loaded system?

Maybe go with MEMCG_CHARGE_BATCH * int_sqrt(num_possible_cpus()),
which sits between linear and log2?

-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  2026-03-19 17:37 ` [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
@ 2026-03-20 11:26   ` Li Wang
  2026-03-20 13:20     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Li Wang @ 2026-03-20 11:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

Waiman Long wrote:

> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -328,8 +328,14 @@ struct mem_cgroup {
>   * size of first charge trial.
>   * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
>   * workload.
> + *
> + * There are 3 common base page sizes - 4k, 16k & 64k. In order to limit the
> + * amount of memory that can be hidden in each percpu memcg_stock for a given
> + * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
>   */
> -#define MEMCG_CHARGE_BATCH 64U
> +#define MEMCG_CHARGE_BATCH_BASE  64U
> +#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT - 12)/2 : 2)
> +#define MEMCG_CHARGE_BATCH	 (MEMCG_CHARGE_BATCH_BASE >> MEMCG_CHARGE_BATCH_SHIFT)

This is a good complement to the first patch. With this change,
I got a chart to compare the three methods (linear, log2, sqrt)
in the count threshold:

4k page size (BATCH=64):
  
  CPUs    linear    log2     sqrt
  --------------------------------
  1       256KB     256KB    256KB
  8       2MB       1MB      512KB
  128     32MB      2MB      2.75MB
  1024    256MB     2.75MB   8MB
	
64k page size (BATCH=16):

  CPUs    linear    log2     sqrt
  -------------------------------
  1       1MB       1MB      1MB
  8       8MB       4MB      2MB
  128     128MB     8MB      11MB
  1024    1GB       11MB     32MB


Both are huge improvements.

log2 flushes more aggressively on large systems, which gives more accurate
stats but at the cost of more frequent synchronous flushes.

sqrt is more conservative, still a massive reduction from linear but gives
more breathing room on large systems, which may be better for performance.

I would leave this choice to you, Waiman, and the data is for reference.

-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/7] selftests: memcg: Iterate pages based on the actual page size
  2026-03-19 17:37 ` [PATCH 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
@ 2026-03-20 11:34   ` Li Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Li Wang @ 2026-03-20 11:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On Thu, Mar 19, 2026 at 01:37:48PM -0400, Waiman Long wrote:
> The current test_memcontrol test fault in memory by write a value
> to the start of a page based on the default value of 4k page size.
> Micro-optimize it by using the actual system page size to do the
> iteration.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Li Wang <liwang@redhat.com>

-- 
Regards,
Li Wang



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus)
  2026-03-20 10:40   ` Li Wang
@ 2026-03-20 13:19     ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-03-20 13:19 UTC (permalink / raw)
  To: Li Wang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang


On 3/20/26 6:40 AM, Li Wang wrote:
> On Thu, Mar 19, 2026 at 01:37:46PM -0400, Waiman Long wrote:
>> The vmstats flush threshold currently increases linearly with the
>> number of online CPUs. As the number of CPUs increases over time, it
>> will become increasingly difficult to meet the threshold and update the
>> vmstats data in a timely manner. These days, systems with hundreds of
>> CPUs or even thousands of them are becoming more common.
>>
>> For example, the test_memcg_sock test of test_memcontrol always fails
>> when running on an arm64 system with 128 CPUs. It is because the
>> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
>> 32 MB of memory. It will be even worse with larger page size like 64k.
>>
>> To make the output of memory.stat more correct, it is better to
>> scale up the threshold logarithmically instead of linearly with the
>> number of CPUs. With the log2 scale, we can use the possibly larger
>> num_possible_cpus() instead of num_online_cpus() which may change at
>> run time.
>>
>> Although there is supposed to be a periodic and asynchronous flush of
>> vmstats every 2 seconds, the actual time lag between succesive runs
>> can actually vary quite a bit. In fact, I have seen time lags of up
>> to 10s of seconds in some cases. So we couldn't too rely on the hope
>> that there will be an asynchronous vmstats flush every 2 seconds. This
>> may be something we need to look into.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/memcontrol.c | 17 ++++++++++++-----
>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 772bac21d155..8d4ede72f05c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -548,20 +548,20 @@ struct memcg_vmstats {
>>    *    rstat update tree grow unbounded.
>>    *
>>    * 2) Flush the stats synchronously on reader side only when there are more than
>> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
>> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
>> - *    only for 2 seconds due to (1).
>> + *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. Though this
>> + *    optimization will let stats be out of sync by up to that amount but only
>> + *    for 2 seconds due to (1).
>>    */
>>   static void flush_memcg_stats_dwork(struct work_struct *w);
>>   static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>>   static u64 flush_last_time;
>> +static int vmstats_flush_threshold __ro_after_init;
>>   
>>   #define FLUSH_TIME (2UL*HZ)
>>   
>>   static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>>   {
>> -	return atomic_read(&vmstats->stats_updates) >
>> -		MEMCG_CHARGE_BATCH * num_online_cpus();
>> +	return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
>>   }
>>   
>>   static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
>> @@ -5191,6 +5191,13 @@ int __init mem_cgroup_init(void)
>>   
>>   	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
>>   				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
>> +	/*
>> +	 * Logarithmically scale up vmstats flush threshold with the number
>> +	 * of CPUs.
>> +	 * N.B. ilog2(1) = 0.
>> +	 */
>> +	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
>> +				  (ilog2(num_possible_cpus()) + 1);
> Changing the threashold from linearly to logarithmically looks smarter,
> but my concern is that, on large systems (hundreds/thousands of CPUs),
> the threshold drops dramatically.
>
> For example, 1024 CPUs it goes from 65536 (256MB) to only 704 (2.7MB),
> that's almost 100x. Could this potentially raise a performance issue
> as frequently read 'memory.stat' on a heavily loaded system?
>
> Maybe go with MEMCG_CHARGE_BATCH * int_sqrt(num_possible_cpus()),
> which sits between linear and log2?

I have also been thinking about scaling faster than log2 but still below 
linear. I believe int_sqrt() is a good suggestion and I will adopt it in 
the next version.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE
  2026-03-20 11:26   ` Li Wang
@ 2026-03-20 13:20     ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-03-20 13:20 UTC (permalink / raw)
  To: Li Wang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Michal Koutný,
	Shuah Khan, Mike Rapoport, linux-kernel, cgroups, linux-mm,
	linux-kselftest, Sean Christopherson, James Houghton,
	Sebastian Chlad, Guopeng Zhang, Li Wang

On 3/20/26 7:26 AM, Li Wang wrote:
> Waiman Long wrote:
>
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -328,8 +328,14 @@ struct mem_cgroup {
>>    * size of first charge trial.
>>    * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
>>    * workload.
>> + *
>> + * There are 3 common base page sizes - 4k, 16k & 64k. In order to limit the
>> + * amount of memory that can be hidden in each percpu memcg_stock for a given
>> + * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
>>    */
>> -#define MEMCG_CHARGE_BATCH 64U
>> +#define MEMCG_CHARGE_BATCH_BASE  64U
>> +#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT - 12)/2 : 2)
>> +#define MEMCG_CHARGE_BATCH	 (MEMCG_CHARGE_BATCH_BASE >> MEMCG_CHARGE_BATCH_SHIFT)
> This is a good complement to the first patch. With this change,
> I got a chart to compare the three methods (linear, log2, sqrt)
> in the count threshold:
>
> 4k page size (BATCH=64):
>    
>    CPUs    linear    log2     sqrt
>    --------------------------------
>    1       256KB     256KB    256KB
>    8       2MB       1MB      512KB
>    128     32MB      2MB      2.75MB
>    1024    256MB     2.75MB   8MB
> 	
> 64k page size (BATCH=16):
>
>    CPUs    linear    log2     sqrt
>    -------------------------------
>    1       1MB       1MB      1MB
>    8       8MB       4MB      2MB
>    128     128MB     8MB      11MB
>    1024    1GB       11MB     32MB
>
>
> Both are huge improvements.
>
> log2 flushes more aggressively on large systems, which gives more accurate
> stats but at the cost of more frequent synchronous flushes.
>
> sqrt is more conservative, still a massive reduction from linear but gives
> more breathing room on large systems, which may be better for performance.
>
> I would leave this choice to you, Waiman, and the data is for reference.
>
I think it is a good idea to use the int_sqrt() function and I will use 
it in the next version.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
  2026-03-20  2:43 ` [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
@ 2026-03-20 15:56   ` Waiman Long
  2026-03-20 20:26     ` Waiman Long
  0 siblings, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-03-20 15:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Tejun Heo, Michal Koutný, Shuah Khan,
	Mike Rapoport, linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang

On 3/19/26 10:43 PM, Andrew Morton wrote:
> On Thu, 19 Mar 2026 13:37:45 -0400 Waiman Long <longman@redhat.com> wrote:
>
>> There are a number of test failures with the running of the
>> test_memcontrol selftest on a 128-core arm64 system on kernels with
>> 4k/16k/64k page sizes. This patch series makes some minor changes to
>> the kernel and the test_memcontrol selftest to address these failures.
>>
>> The first kernel patch scales the memcg vmstats flush threshold
>> logarithmetically instead of linearly with the total number of CPUs. The
>> second kernel patch scale down MEMCG_CHARGE_BATCH with increases in page
>> size. These 2 patches help to reduce the discrepancies between the
>> reported usage data with the real ones.
>>
>> The next 5 test_memcontrol selftest patches adjust the testing code to
>> greatly reduce the chance that it will report failure, though some
>> occasional failures is still possible.
>>
>> To verify the changes, the test_memcontrol selftest was run 100
>> times each on a 128-core arm64 system on kernels with 4k/16k/64k
>> page sizes.  No failure was observed other than some failures of the
>> test_memcg_reclaim test when running on a 16k page size kernel. The
>> reclaim_until() call failed because of the unexpected over-reclaim of
>> memory. This will need a further look but it happens with the 16k page
>> size kernel only and I don't have a production ready kernel config file
>> to use in buildinig this 16k page size kernel. The new test_memcontrol
>> selftest and kernel were also run on a 96-core x86 system to make sure
>> there was no regression.
> AI reviewbot asks questions:
> 	https://sashiko.dev/#/patchset/20260319173752.1472864-1-longman%40redhat.com

 >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 >> --- a/mm/memcontrol.c
 >> +++ b/mm/memcontrol.c
 >> @@ -548,20 +548,20 @@ struct memcg_vmstats {
 >>   *    rstat update tree grow unbounded.
 >>   *
 >>   * 2) Flush the stats synchronously on reader side only when there 
are more than
 >> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this 
optimization
 >> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * 
nr_cpus) but
 >> - *    only for 2 seconds due to (1).
 >> + *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. 
Though this
 >> + *    optimization will let stats be out of sync by up to that 
amount but only
 >> + *    for 2 seconds due to (1).
 > Is this description accurate regarding the maximum out of sync amount?
 > Looking at memcg_rstat_updated(), updates are buffered locally on 
each CPU
 > up to MEMCG_CHARGE_BATCH - 1 before they are added to the global
 > vmstats->stats_updates counter.
 > Because memcg_vmstats_needs_flush() only checks the global counter, could
 > N CPUs each buffer MEMCG_CHARGE_BATCH - 1 updates without triggering a
 > synchronous flush?
 > If so, wouldn't the actual worst-case out-of-sync error be
 > N * (MEMCG_CHARGE_BATCH - 1) + vmstats_flush_threshold, which remains
 > linear with the number of CPUs rather than scaling logarithmically?

Good point, the worst case scenario can indeed be worse than that. I
will update the comment accordingly.

 >> @@ -328,8 +328,14 @@ struct mem_cgroup {
 >>   * size of first charge trial.
 >>   * TODO: maybe necessary to use big numbers in big irons or dynamic 
based of the
 >>   * workload.
 >> + *
 >> + * There are 3 common base page sizes - 4k, 16k & 64k. In order to 
limit the
 >> + * amount of memory that can be hidden in each percpu memcg_stock 
for a given
 >> + * memcg, we scale down MEMCG_CHARGE_BATCH by 2 for 16k and 4 for 64k.
 >>   */
 >> -#define MEMCG_CHARGE_BATCH 64U
 >> +#define MEMCG_CHARGE_BATCH_BASE  64U
 >> +#define MEMCG_CHARGE_BATCH_SHIFT ((PAGE_SHIFT <= 16) ? (PAGE_SHIFT 
- 12)/2 : 2)
 >> +#define MEMCG_CHARGE_BATCH     (MEMCG_CHARGE_BATCH_BASE >> 
MEMCG_CHARGE_BATCH_SHIFT)
 > Will this cause false failures in the cgroup selftests on systems with
 > larger page sizes?

Exactly, Li Wang had posted a patch [1] to address this issue. We will have
to update this code again if this patch is adopted to reduce the expected
MAX_VMSTAT_ERROR accordingly.

[1] https://lore.kernel.org/lkml/20260306071843.149147-1-liwang@redhat.com/

 > In tools/testing/selftests/cgroup/test_kmem.c, MAX_VMSTAT_ERROR is 
hardcoded
 > assuming a 4KB page size and a 64-page charge batch (256KB per CPU):
 > #define MAX_VMSTAT_ERROR (4096 * 64 * get_nprocs())
 > ...
 >     if (labs(sum - current) < MAX_VMSTAT_ERROR) {
 > ...
 > With this change, a 64KB page system will have a batch size of 16. The
 > theoretical per-CPU error becomes 16 pages * 64KB = 1024KB per CPU, which
 > is 4 times larger than the hardcoded MAX_VMSTAT_ERROR.
 > Does MAX_VMSTAT_ERROR in test_kmem.c need to be updated to compute 
the error
 > using the runtime system page size and the adjusted batch limit?

 >> diff --git 
a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h 
b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> index 77f386dab5e8..c25228a78b8b 100644
 >> --- a/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> +++ b/tools/testing/selftests/cgroup/lib/include/cgroup_util.h
 >> @@ -6,6 +6,7 @@
 >>  #define PAGE_SIZE 4096
 >>  #endif
 >>
 >> +#define KB(x) (x << 10)
 > This isn't a bug, but could this macro definition risk operator 
precedence
 > issues if invoked with an expression? Should it be defined with inner
 > parentheses around the argument, like #define KB(x) ((x) << 10)?

So far, only a single integer value is used for x. So the current macro
is still OK. Yes, I do agree that adding parenthesis will be more safe
for future use cases.

 >> @@ -1181,7 +1199,7 @@ static int test_memcg_swap_max_peak(const char 
*root)
 >>      if (cg_read_long(memcg, "memory.peak") < MB(29))
 >>          goto cleanup;
 >>
 >> -    if (cg_read_long(memcg, "memory.swap.peak") < MB(29))
 >> +    if (cg_read_long(memcg, "memory.swap.peak") < swap_peak)
 >>          goto cleanup;
 >>
 >>      if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30)))
 >>          goto cleanup;
 > Does leaving memory.high set to 29M alter the semantics of the subsequent
 > alloc_anon_50M_check_swap test?
 > When alloc_anon_50M_check_swap runs, it allocates 50MB and expects
 > memory.current to be close to the 30MB memory.max limit.
 > Because memory.high is still set to 29MB from the earlier operation, 
the 50MB
 > allocation gets throttled at 29MB and enters synchronous reclaim before
 > reaching the 30MB memory.max limit.
 > Could this inadvertently change the second test from verifying swap 
behavior
 > under a hard limit to verifying it under a soft limit?

The purpose of setting memory.high to 29M is to slow down the memory
allocation process in order to enable the swapping code to swap out
more anonymous memory before the OOM killer comes in and kill the
process. Otherwise, the actual swap out value will be even lower with
larger page size. I can drop setting memory.high and set the threshold
even lower in order to avoid expected failures.

 >> @@ -1477,12 +1477,20 @@ static int test_memcg_sock(const char *root)
 >>       * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
 >>       * scheduling slack) and require that the "sock " counter
 >>       * eventually drops to zero.
 >> +     *
 >> +     * The actual run-to-run elapse time between consecutive run
 >> +     * of asynchronous memcg rstat flush may varies quite a bit.
 >> +     * So the 3 seconds wait time may not be enough for the "sock"
 >> +     * counter to go down to 0. Treat it as a XFAIL instead of
 >> +     * a FAIL.
 >>       */
 >>      sock_post = cg_read_key_long_poll(memcg, "memory.stat", "sock ", 0,
 >>                       MEMCG_SOCKSTAT_WAIT_RETRIES,
 >>                       DEFAULT_WAIT_INTERVAL_US);
 >> -    if (sock_post)
 >> +    if (sock_post) {
 >> +        ret = KSFT_XFAIL;
 >>          goto cleanup;
 >> +    }
 > Does this code inadvertently mask actual system or I/O errors?
 > If cg_read_key_long_poll() returns -1 because it failed to read the 
file or
 > the key is missing, this check will treat it as an expected timeout 
failure
 > rather than an actual test failure.

I will add a positive value check before setting KSFT_XFAIL.

 > Does marking this condition as KSFT_XFAIL prevent the test from catching
 > genuine socket memory leaks?
 > If a kernel regression causes socket memory to actually leak, the 
test will
 > time out and report an expected failure, which CI systems might ignore.
 > Would it be more robust to increase the polling timeout to 
accommodate the
 > maximum latency observed, or manually trigger a synchronous flush, 
instead
 > of masking the timeout?

We may have to increase the timeout excessively in order to allow for
the possible variations of the asynchronous vmstats flush delay. That may
make the test take too long to run. In my own test, the current code 
will fail
rather frequently without this change.

I do suggest that we will have to look into this issue and we can remove 
this expected failure if the issue is fixed.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes
  2026-03-20 15:56   ` Waiman Long
@ 2026-03-20 20:26     ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-03-20 20:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Tejun Heo, Michal Koutný, Shuah Khan,
	Mike Rapoport, linux-kernel, cgroups, linux-mm, linux-kselftest,
	Sean Christopherson, James Houghton, Sebastian Chlad,
	Guopeng Zhang, Li Wang

On 3/20/26 11:56 AM, Waiman Long wrote:
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -548,20 +548,20 @@ struct memcg_vmstats {
> >>   *    rstat update tree grow unbounded.
> >>   *
> >>   * 2) Flush the stats synchronously on reader side only when there 
> are more than
> >> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this 
> optimization
> >> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH 
> * nr_cpus) but
> >> - *    only for 2 seconds due to (1).
> >> + *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. 
> Though this
> >> + *    optimization will let stats be out of sync by up to that 
> amount but only
> >> + *    for 2 seconds due to (1).
> > Is this description accurate regarding the maximum out of sync amount?
> > Looking at memcg_rstat_updated(), updates are buffered locally on 
> each CPU
> > up to MEMCG_CHARGE_BATCH - 1 before they are added to the global
> > vmstats->stats_updates counter.
> > Because memcg_vmstats_needs_flush() only checks the global counter, 
> could
> > N CPUs each buffer MEMCG_CHARGE_BATCH - 1 updates without triggering a
> > synchronous flush?
> > If so, wouldn't the actual worst-case out-of-sync error be
> > N * (MEMCG_CHARGE_BATCH - 1) + vmstats_flush_threshold, which remains
> > linear with the number of CPUs rather than scaling logarithmically?
>
> Good point, the worst case scenario can indeed be worse than that. I
> will update the comment accordingly.

Looking at the code again, the hidden charge in memcg_stock should only 
affect memory.current, not memory.stat. There is nothing to add to the 
worst case situation.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-03-20 20:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 17:37 [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
2026-03-19 17:37 ` [PATCH 1/7] memcg: Scale up vmstats flush threshold with log2(nums_possible_cpus) Waiman Long
2026-03-20 10:40   ` Li Wang
2026-03-20 13:19     ` Waiman Long
2026-03-19 17:37 ` [PATCH 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-20 11:26   ` Li Wang
2026-03-20 13:20     ` Waiman Long
2026-03-19 17:37 ` [PATCH 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-20 11:34   ` Li Wang
2026-03-19 17:37 ` [PATCH 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-19 17:37 ` [PATCH 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-19 17:37 ` [PATCH 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-19 17:37 ` [PATCH 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-20  2:43 ` [PATCH 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
2026-03-20 15:56   ` Waiman Long
2026-03-20 20:26     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox